eval-driven prompt iteration

the loop most teams run on prompts: change a phrase, eyeball a few outputs, ship. the regression two weeks later: the new prompt fixed one case and silently broke three others.

the loop that works

write the cases first. each case is an input + a property that should hold of the output. the property can be loose (“the output mentions a refund policy”) or tight (“the output is valid json with these keys”). either way it’s a function that returns true or false. you have a test suite. this is the same idea hamel husain has been writing about for years — see his field guide to evals — and it’s the only practice i’ve seen change prompt quality at scale.

once you have ten cases, you stop iterating on vibes. each prompt change runs against the suite. you see which cases pass and which fail. you can a/b prompts the way you a/b code.

write the failing case first

the move that took me longest to learn: write the failing case before you change the prompt. if you change the prompt and then write the case, you’ve encoded what the new prompt happens to do — not what you actually wanted. the case anchors the intent. the prompt fits to the case.

a starter setup, in typescript:

type Case = {
  name: string;
  input: string;
  check: (output: string) => boolean;
};

const cases: Case[] = [/* ... */];

for (const c of cases) {
  const output = await runAgent(c.input);
  console.log(c.name, c.check(output) ? "ok" : "fail");
}

that’s it. no framework. you can wrap it in vitest later if you want ci. the value isn’t in the framework, it’s in the discipline of writing the case before the change.

llm-as-judge is the next step up — when “the output is correct” is too subjective for a regex, you have a smaller, cheaper model score the answer against a rubric. but the same shape applies: cases, properties, run, score.

if you’re going to ship prompts, you have to ship them with their evals. otherwise you’re just hoping. agentic patterns covers this in more depth, and spec-kit takes the same idea one grain coarser — the spec becomes the failing case, written before the implementation.