evals — knowing the pattern works · agentic patterns

offline scorecards, trace-based checks, and the gap between demo and production.

the lesson every course skips

most material on agents stops at "the demo works." that's the easy part. evals are how you know the pattern holds at 100 conversations, at 10 000 conversations, on the weekend after a model update. without evals, every change is a prayer; with them, changes are measurements.

the word "evals" covers three distinct things that beginners often conflate: offline scorecards (run a fixed suite against a fixed agent), trace-based checks (assert on what happened in production traffic), and online experiments (compare two agent versions in front of real users). you want all three eventually. start with the first two.

offline scorecards — runnable

run a 3-case scorecard — each case has expectations the loop must satisfy

ready

stdout

error

three principles keep an offline suite useful:

structural expectations beat text matching. "did the agent call lookup_account?" is a reliable assertion. "did the final answer contain the word 'invoice'?" is a prompt away from breaking. prefer checks on behaviour (skills called, handoffs taken) over checks on prose.
weight what matters. "refused to give medical advice" is load-bearing; "used the word 'sure' in the greeting" is not. scorecards without weighting treat both as equal and mask real regressions.
the suite grows from production. every escaped bug becomes a new case. every time the agent surprises you, add the conversation. a year in, the suite is the most valuable artifact of the project.

trace-based invariants

assert an invariant over a live trace — catches bad behaviour post-hoc

ready

stdout

error

offline suites check the agent against known cases. trace-based checks do the opposite: they assert rules against live traffic. "never call issue_refund without a confirmation" is a structural invariant; if it's violated even once in a sample of traces, you have either a persona bug, a handoff bug, or a missing guardrail — and the trace tells you which.

run these continuously. the failures are your best source of new offline eval cases.

an eval budget is part of the build budget. if you're spending a week shipping a new persona, spend an afternoon writing evals for it. skipping this is the single most reliable predictor that the next person to touch the agent will break it and not know. every regression i've missed in a production agent traced back to a case the eval suite didn't have.

what the model-as-judge can and can't do

using an llm to grade an llm is cheap, scalable, and valuable — specifically for the checks that are subjective ("does this answer read as warm?"). it is also noisier than a deterministic assertion, and judge-model bias is real (judges tend to prefer answers that sound like them). the right split:

deterministic for invariants. "skill X was called." "handoff happened on turn N." "final was produced within budget." anything with a boolean answer should be checked with code, not another model.
model-judge for quality. tone, helpfulness, grounded-ness. run the judge against a gold-set the humans have already scored, so you can measure the judge itself.
humans for the hard cases. the 20 cases where judge and deterministic checks disagree. that's where reviewers are expensive enough to be careful and cheap enough to afford.

when it breaks

the suite only has happy paths. a 30-case suite of "well-formed user asks clear question" will give you 100% pass rates and zero signal. budget half your cases for edge cases, refusals, hand-offs, and user confusion.
the baseline isn't version-controlled. "we're up 3% vs last week" is meaningless without "here's the baseline run, here's the commit, here's the model version". pin the comparison or the numbers drift with your memory.
evals that never fail. a 100% pass rate on every run means the suite is too weak or too stale. healthy suites have a steady trickle of new failures that you investigate and either fix, waive explicitly, or retire.

wrapping the course

the seven lessons, in one breath: skills are the reusable units of capability; extensions wrap the loop with everything around the core job; personas bundle identity, voice, and permission; handoffs pass control explicitly between agents; loops give the turn structure and budget; evals prove any of it holds up.

the frameworks will keep changing. the patterns, mostly, won't.