the middleware pattern, for agents
extensions are the middleware of the agent world. they wrap the base loop — the thing that takes a message, runs the model, maybe calls a skill, returns a response — and layer in everything that's around the core job: memory injection, retrieval, retries, rate limits, caching, telemetry, safety filters, token budgeting. if you've written express middleware or a rails interceptor, this is that, for agent turns.
the reason to call them a pattern, not just "some code you wrap things
with": treating them as composable (ctx, next) ⇒ result
objects is what lets you add and remove concerns without touching the
loop, and lets you reason about them independently.
composition — retrieval, telemetry, safety
three small rules keep extensions clean and debuggable:
- one concern per extension. retrieval is one thing, telemetry is another, safety is a third. an extension that does two things becomes the extension that always needs changing.
- mutate ctx, don't replace it.
extensions share a
ctxobject. adding to it is fine (see howwithRetrievalwrites toctx.retrieved); swapping it for a new object breaks the chain because downstream extensions already have references to the old one. - call
nextexactly once. zero calls short-circuits the loop (sometimes you want this — a cache hit, a hard block). two calls runs the agent twice and produces duplicate telemetry. one is the contract.
order matters — the onion
outer extensions see everything — including what inner extensions did. telemetry usually goes outermost so it captures retries; safety often goes just inside retrieval so it can inspect what the model produced with context. there's no universal stack, just the one you document for your team.
when it breaks
- hidden state across extensions.
extension A stashes something on
ctxthat extension C relies on. now the order of A and C is load-bearing and invisible. document the contract or use a typed ctx with explicit fields. - retry extensions that don't know about idempotency. a naive retry layer will re-send the same tool call that already charged a card. either make the skills idempotent or make the retry layer aware of which skills are safe to replay.
- safety at the wrong layer. moderation run before the model doesn't catch the model's own output. moderation run after tool calls doesn't stop the tool call that already happened. pick the point in the turn that actually corresponds to what you're trying to prevent.
next: personas — the bundle of prompt, voice, allowed skills, and guardrails that makes one loop behave like many different agents.