the lesson every course skips
most material on agents stops at "the demo works." that's the easy part. evals are how you know the pattern holds at 100 conversations, at 10 000 conversations, on the weekend after a model update. without evals, every change is a prayer; with them, changes are measurements.
the word "evals" covers three distinct things that beginners often conflate: offline scorecards (run a fixed suite against a fixed agent), trace-based checks (assert on what happened in production traffic), and online experiments (compare two agent versions in front of real users). you want all three eventually. start with the first two.
offline scorecards — runnable
three principles keep an offline suite useful:
- structural expectations beat text matching.
"did the agent call
lookup_account?" is a reliable assertion. "did the final answer contain the word 'invoice'?" is a prompt away from breaking. prefer checks on behaviour (skills called, handoffs taken) over checks on prose. - weight what matters. "refused to give medical advice" is load-bearing; "used the word 'sure' in the greeting" is not. scorecards without weighting treat both as equal and mask real regressions.
- the suite grows from production. every escaped bug becomes a new case. every time the agent surprises you, add the conversation. a year in, the suite is the most valuable artifact of the project.
trace-based invariants
offline suites check the agent against known cases. trace-based checks
do the opposite: they assert rules against live traffic. "never call
issue_refund without a confirmation" is a structural
invariant; if it's violated even once in a sample of traces, you have
either a persona bug, a handoff bug, or a missing guardrail — and the
trace tells you which.
run these continuously. the failures are your best source of new offline eval cases.
what the model-as-judge can and can't do
using an llm to grade an llm is cheap, scalable, and valuable — specifically for the checks that are subjective ("does this answer read as warm?"). it is also noisier than a deterministic assertion, and judge-model bias is real (judges tend to prefer answers that sound like them). the right split:
- deterministic for invariants. "skill X was called." "handoff happened on turn N." "final was produced within budget." anything with a boolean answer should be checked with code, not another model.
- model-judge for quality. tone, helpfulness, grounded-ness. run the judge against a gold-set the humans have already scored, so you can measure the judge itself.
- humans for the hard cases. the 20 cases where judge and deterministic checks disagree. that's where reviewers are expensive enough to be careful and cheap enough to afford.
when it breaks
- the suite only has happy paths. a 30-case suite of "well-formed user asks clear question" will give you 100% pass rates and zero signal. budget half your cases for edge cases, refusals, hand-offs, and user confusion.
- the baseline isn't version-controlled. "we're up 3% vs last week" is meaningless without "here's the baseline run, here's the commit, here's the model version". pin the comparison or the numbers drift with your memory.
- evals that never fail. a 100% pass rate on every run means the suite is too weak or too stale. healthy suites have a steady trickle of new failures that you investigate and either fix, waive explicitly, or retire.
wrapping the course
the seven lessons, in one breath: skills are the reusable units of capability; extensions wrap the loop with everything around the core job; personas bundle identity, voice, and permission; handoffs pass control explicitly between agents; loops give the turn structure and budget; evals prove any of it holds up.
the frameworks will keep changing. the patterns, mostly, won't.