synthetic apis are the substrate, not the scaffolding

an agent is an llm calling tools in a loop. that’s the whole mechanism: the model reads context, decides to invoke a tool, gets a result, and loops on that feedback until it’s done. anthropic’s framing is deliberately this plain, and it’s worth taking literally.

a tool, from the model’s side, is almost nothing. a name, and a json schema for its inputs. in mcp it’s tools/list for discovery and tools/call for invocation; in the provider apis it’s “function calling.” either way the model never sees your implementation. it sees the schema you advertised and the result you returned. nothing else crosses the boundary.

which means a mock that honors the same contract is, from the agent’s perspective, indistinguishable from the real backend. that’s not a limitation to work around. it’s the leverage. you can build and evaluate the entire agent against a synthetic backend before a real one exists — and, more usefully, keep evaluating against it long after the real one ships.

when i wired up the agentic chat over a synthetic bank api, the n8n tools agent saw /accounts, /transactions, /aggregate, /recurring, /statements as five tool schemas. it could not tell that behind them was a mulberry32-seeded express service rather than a real core-banking system. that’s the thesis in one sentence: the agent can’t tell the difference, and that’s the point.

the boring case first

most of this isn’t new. mock-from-contract is a mature pattern: write the openapi document, stand up a mock server from it (prism, mockoon, wiremock), and build the client against spec-conformant responses while the backend gets implemented behind the same interface. it decouples the consumer from backend availability and lets two teams work in parallel against one agreed shape. that’s the headline use case for prism and postman’s mock servers, and it predates agents by a decade.

the part worth keeping for what follows is failure-path testing. a mock returns a 500, an empty list, a malformed payload, a 429, a connection reset — on demand, deterministically. wiremock documents exactly these as fault types (EMPTY_RESPONSE, MALFORMED_RESPONSE_CHUNK, CONNECTION_RESET_BY_PEER, configurable delays). triggering those reliably against a real backend ranges from hard to impossible. against a mock it’s a flag.

this is just the test pyramid argument — end-to-end tests are slow and flaky, so you isolate behind test doubles. a synthetic api is that doubling applied to an llm’s tool surface. the agent twist is that the thing you’re isolating isn’t a flaky network. it’s the model.

why it compounds for agents

here’s the part specific to agents. you cannot make the model deterministic.

the intuition is that temperature: 0 gets you there. it doesn’t. greedy decoding removes the sampling randomness, but on hosted, dynamically-batched endpoints the same prompt still yields different tokens across runs — because the gpu kernels are batch-size dependent and floating-point addition isn’t associative, so the numerics shift with whatever else happened to be in the batch. thinking machines walked through this in detail: it’s reducible in principle with batch-invariant kernels, at a real throughput cost, but it is almost never reduced on the shared endpoints you actually call. so for evaluation, treat the model as non-deterministic and stop fighting it.

if one half of the loop is irreducibly noisy, the only way to get a stable signal is to pin the other half completely. make the environment deterministic and the model becomes the single moving part. then a difference between two runs is attributable — it’s the model, because nothing else changed.

  the only moving part          pinned every run
  ┌───────────────────┐         ┌───────────────────────┐
  │ llm               │──call──►│ synthetic api         │
  │ non-deterministic │         │ seed=42, clock frozen │
  │ even at temp 0    │◄─reply──│ → identical every run │
  └───────────────────┘         └───────────────────────┘
     │   ▲
     └───┘  loop until done

for the bank api, the two knobs are a fixed mulberry32 seed and a NOW=2026-06-01 override so “last month” resolves the same way every run. that’s it. the same prompt against the same seeded mock produces a comparable trace every time — which is the only substrate an eval can stand on. this isolates the tool-use signal; it doesn’t replace pinning the model version and averaging across rollouts, it’s what makes those numbers mean something.

grade the end state, not the path

once the environment is deterministic, the eval design that actually works falls out of it: run each trial in an environment reset to a clean state, grade what the agent produced rather than the exact path it took, and test both the cases where a behavior should occur and where it should not.

sierra’s tau-bench is the cleanest proof. it puts simulated domain apis over a resettable database and scores a run by hashing the final database state against an annotated goal state — no human grader, no model-as-judge. that only works because the environment is deterministic and state-comparable. it’s also what let them measure reliability honestly: their pass^k metric (does the agent succeed on all k retries of the same task) showed frontier function-calling agents are wildly inconsistent — under 25% pass^8 in the retail domain. you can only see that when the world holds still.

two things follow for your own evals:

reset per trial, or your scores lie. leftover state from a previous run leaks into the next one, causes correlated failures, and inflates the pass rate. a re-seedable mock gives every case its own clean room.
assert on outcome, not trajectory. check that the agent left the bank’s state in the expected configuration after a “move 500 kr to savings” reasoning chain — not that it called tools in one specific order. trajectory matching punishes valid alternative orderings, and there are usually several.

// grade the end state, not the path
resetMock({ seed: 42, now: "2026-06-01" });
await agent.run("move 500 kr from everyday to savings");

expect(mock.state()).toEqual(goalState);   // outcome
// not: expect(toolCalls).toEqual(["get_accounts", "transfer"]);

and because cases are cheap against a mock, add the negative ones: a plain “hi” should call no tool at all. n8n’s own evaluations ship a deterministic “tools used” metric that asserts exactly which tools fired and how many times — no judge model required. that’s the eval-driven loop applied one level up, at the agent instead of the prompt.

don’t let a dev-loop agent send real money

sandboxing is the use case production literally cannot give you. anthropic’s guidance for agents that take actions is “extensive testing in sandboxed environments, along with the appropriate guardrails.” mcp lets you annotate a tool with readOnlyHint, destructiveHint, idempotentHint — but the spec is explicit that these are informational hints, not enforcement. a hint doesn’t stop a looping agent from calling /transfer. the mock does. the synthetic backend is the actual enforcement boundary, which strengthens the case for it rather than weakening it.

a synthetic /transfer that records intent but moves no money lets you exercise the dangerous branch hundreds of times in regression with zero blast radius. you would never let a half-built agent hammer a real send-money endpoint in a loop. against the mock you can, and you should, because that’s the branch most likely to be wrong.

the same flag-flip turns rare failures into permanent regression cases. agents loop — i flagged this in the oauth work too, where a runaway agent is its own rate-limiting problem — so the recovery path matters:

// mock-bank-api: one env flag turns a rare failure into a fixture
app.get("/transactions", (req, res) => {
  if (process.env.FAULT === "rate_limit") {
    return res.status(429).json({ error: "rate_limited", retry_after: 2 });
  }
  res.json(query(req.query));
});

now “what does the agent do when /transactions 429s mid-conversation” is a deterministic test, not a thing you hit by luck once a month. the structured-error shape the model can actually reason about is what you assert against: did it back off and retry, or did it invent a balance to paper over the gap. against a live api you can’t make that failure happen on command. against a mock it’s a line.

the data layer

the determinism has to reach the data, not just the routes. the simplest tier is a seeded faker: faker.seed(42) initializes a mersenne twister so the same seed plus the same sequence of calls yields byte-identical fixtures. good enough to make tool responses reproducible across eval runs.

the trap — and it’s load-bearing — is that faker’s output depends on both the seed and the number of calls made since you set it. insert one call upstream and every value after it silently shifts:

faker.seed(42);
const name = faker.person.fullName();    // add this line later...
const balance = faker.finance.amount();  // ...and this value changes, same seed

so a refactor that adds one field quietly breaks a snapshot the seed was supposed to protect. seeds also aren’t stable across faker major versions — the v9 upgrade guide says outright that seeded snapshots have to be regenerated. determinism that’s fragile to refactors is a maintenance liability dressed up as a guarantee.

a stateless mapper fixes the fragility. supabase’s copycat hashes each input (siphash under the hood) so a given input always maps to the same output regardless of call order or environment. customer_id 1001 resolves to the same fake name in /accounts, /transactions, and /statements with no lookup table — referential integrity for free, and pii pseudonymization as a side effect. (pseudonymization, not anonymization: the readme is honest that a known input space is dictionary-attackable, mitigated by a hash key.)

what i would not reach for here is an llm to generate the dataset. a 2026 study — “large language models are bad dice players” — found models lack a real internal sampler: under independent draws, 10 of 11 models matched zero target distributions, accuracy got worse as the sample grew, with systematic positional and demographic skew and collapsed variance. great for narrative variety and edge-case prose, wrong for honest statistics. for a mock that’s meant to behave like a population of accounts, a seeded prng plus a stateless mapper is the sweet spot. and because the whole dataset is synthetic and seeded, there is zero real pii in the repo, in ci logs, or in the model’s context window — which is a property of how it was built, not a magic word. “synthetic” is not a gdpr exemption on its own.

a green mock proves shape, not truth

the honest caveat is the part that keeps this credible, so it goes last and loudest.

a mock validates the shape of an interaction and your policy around it. it does not validate that the real backend behaves. the failure mode is mock drift, and it’s characteristically silent: the mock is correct the day you write it, then the real api renames a field, adds a required param, widens an enum, makes an optional field nullable — and the mock keeps the old shape. your tests stay green while production breaks. worse, schema-valid is not semantically correct: a response can pass openapi validation and still return the wrong thing — microcks’ canonical example is ordering a millefeuille and getting an éclair. a green test against a permissive mock is confidence about structure, not about truth.

and a mock encodes your assumptions about the dependency, not its behavior. elastic showed this nicely: a change from column indices to column names broke their mock-based tests while the real-integration test sailed through. the mock can’t tell you when your mental model of the dependency is the thing that’s wrong.

then there’s everything mocks routinely skip: auth and oauth token refresh, rate limiting, real pagination cursors, concurrency, cold-start latency. the whole discovery-and-auth dance an mcp client walks — the 401, the token refresh, the paginated cursor, the three-second cold start — your agent never sees any of it against an instant in-memory mock. that’s the integration cliff, and it’s exactly where most real tool-use failures live. an agent that has never seen a 401-then-refresh is untested on the 401-then-refresh. agents will even over-fit to mock-world quirks and reward-hack the harness if you let them — 2026 benchmark work caught agents reading gold answers through file:// urls instead of solving the task.

none of that is an argument against synthetic apis. it’s an argument for the right shape:

derive the mock from the same spec the backend publishes, so they can’t disagree by accident.
run consumer-driven contract tests (pact) so the contract is generated from the agent-as-consumer’s actual calls and verified against the provider — the mechanism that catches drift before deploy.
keep a thin layer of real-integration evals for grounding. they’re slow and you’ll have few of them; that’s fine. they’re the truth check, not the inner loop.

synthetic for velocity and scale in the inner loop, where you run thousands of deterministic trials. real for the few that confirm the inner loop wasn’t measuring a fiction.

the synthetic api isn’t training wheels you take off when the real backend ships. it’s the only surface on which the agent’s behavior is a function instead of a roll of the dice — which is precisely why it outlives the thing it was standing in for.