scalable/ai/agentic patterns/lesson 07 cph / /
lesson 07 / 07 · 15 min · updated ·

evals — knowing the pattern works

offline scorecards, trace-based checks, and the gap between demo and production.

the lesson every course skips

most material on agents stops at "the demo works." that's the easy part. evals are how you know the pattern holds at 100 conversations, at 10 000 conversations, on the weekend after a model update. without evals, every change is a prayer; with them, changes are measurements.

the word "evals" covers three distinct things that beginners often conflate: offline scorecards (run a fixed suite against a fixed agent), trace-based checks (assert on what happened in production traffic), and online experiments (compare two agent versions in front of real users). you want all three eventually. start with the first two.

offline scorecards — runnable

run a 3-case scorecard — each case has expectations the loop must satisfy
ready

three principles keep an offline suite useful:

trace-based invariants

assert an invariant over a live trace — catches bad behaviour post-hoc
ready

offline suites check the agent against known cases. trace-based checks do the opposite: they assert rules against live traffic. "never call issue_refund without a confirmation" is a structural invariant; if it's violated even once in a sample of traces, you have either a persona bug, a handoff bug, or a missing guardrail — and the trace tells you which.

run these continuously. the failures are your best source of new offline eval cases.

an eval budget is part of the build budget. if you're spending a week shipping a new persona, spend an afternoon writing evals for it. skipping this is the single most reliable predictor that the next person to touch the agent will break it and not know. every regression i've missed in a production agent traced back to a case the eval suite didn't have.

what the model-as-judge can and can't do

using an llm to grade an llm is cheap, scalable, and valuable — specifically for the checks that are subjective ("does this answer read as warm?"). it is also noisier than a deterministic assertion, and judge-model bias is real (judges tend to prefer answers that sound like them). the right split:

when it breaks

wrapping the course

the seven lessons, in one breath: skills are the reusable units of capability; extensions wrap the loop with everything around the core job; personas bundle identity, voice, and permission; handoffs pass control explicitly between agents; loops give the turn structure and budget; evals prove any of it holds up.

the frameworks will keep changing. the patterns, mostly, won't.

scalable labs·cvr 30091604·github·linkedin·hello@scalable.dk