scalable systems handle growth without breaking
← back · 6 min read ·

skills — the smallest unit

a tool call is not a unit of reuse. a skill is. what 40 copy-pasted handlers taught me.

my first production agent crossed forty tool calls before anyone on the team said the word library. each one was a function with a json schema stapled to it, written in whatever style the author was in the mood for that day. some validated input, most didn’t. some returned { error }, some threw, one returned a string that started with "ERROR: ". the model coped, mostly, which is exactly the problem — the model coping is what lets the sprawl compound until cleanup costs a sprint.

the cleanup taught me the pattern. it now has a name in the agentic patterns course: a skill.

a tool call plus everything it needs

a skill is the smallest reusable unit in an agent system. it’s a tool call plus everything around it that the tool call needs to be safe and composable:

if you’ve written tool use with any sdk, you’ve written most of a skill. the pattern isn’t new machinery — it’s the decision to treat that bundle as a first-class object. defined once, tested without a model in the loop, handed to many agents. the agent picks a subset of skills it’s allowed to use; the skill has no idea which agent called it. that separation is the whole point.

the description is a prompt

anthropic’s tool use docs say to write detailed descriptions, and everyone nods and writes one sentence. here’s the reframe that made me take it seriously: the description is read hundreds of times a day, by the model, in a hurry. it’s not documentation. it’s product copy with a conversion rate.

the failure mode is the generous description. every skill description that ends in “use this for most things” will be picked for most things. i had a search_records skill described as “search the database” sitting next to four narrower lookups — the model routed almost everything through search_records, because of course it did. the menu said general store next to four specialty shops. write narrow descriptions and let the model pick from a clear menu; routing quality is mostly a writing problem before it’s a model problem.

validation is the cheapest eval you’ll ever write

input validation catches the hallucinated argument before it hits your database. the model inventing a 9-hour calendar booking, a negative amount, an id from a different entity type — these aren’t rare events, they’re tuesday. a validator is a few lines and it converts silent corruption into structured feedback.

structured is the operative word. when validation fails, the skill should return a result the model can reason aboutkind: "validation", a message naming the field, retriable: true — not a stack trace. the model is good at reading a small enum and retrying with fixed arguments. it’s bad at reading a traceback. the loop is already a decent error-handling harness if you feed it something legible.

output validation is the half everyone skips, and it’s the half that catches drift. when the downstream api widens an enum or makes a field nullable, the skill that “still works” because output checking was optional is a landmine with a long fuse. fail loud at the boundary. a green mock proves shape, not truth — the same logic applies one level down: a skill that doesn’t check its output is asserting nothing.

the context object is the side-channel

the llm supplies the semantic arguments — which calendar, what time, how long. the host supplies the environment — auth tokens, db clients, the current user’s timezone, feature flags. those travel in a context object the model never sees.

mixing the two is how credentials end up in a prompt. if the model has to pass userId and the host trusts it, you’ve built confused-deputy-as-a-service: any successfully injected instruction can read someone else’s data with the agent’s permissions. the rule is mechanical — anything that identifies who or grants access comes from the host side, full stop. the model’s arguments are user input. treat them with exactly that much trust.

skills are leaves

the tempting move, around skill fifteen, is a skill that calls another skill. book_meeting calls list_availability internally, picks a slot, then calls book_slot. it works, and you’ve just built an implicit graph the model can’t see and you can’t trace. when the inner call fails, the error surfaces from a skill the model never invoked, and the retry logic guesses wrong.

push composition up. either the agent loop sequences the calls — that’s what it’s for — or a deliberate orchestration layer does, an extension wrapping the loop, visible and traceable. skills should be leaves: one capability, one boundary, no hidden fan-out.

a note on the name

anthropic now ships something called agent skills — folders of instructions and scripts the model loads on demand. different mechanism than the pattern here (theirs is packaged knowledge, this is a packaged capability), but the instinct is identical: don’t hand the model a bare function, hand it the capability bundled with everything needed to use it correctly. the convergence is the tell that the underlying unit is right. mcp tools, sdk tool use, skill folders — wire formats vary; the host-side discipline is the same.

the payoff is the ratio

you will have 30–80 skills long before you have 30 agents. that ratio is why the pattern earns its keep: a skill is a plain function, so it unit-tests without a model in the loop — validators, error shapes, happy path, all deterministic, all fast. the only part that needs an eval is the description, because the description is the only part the model interprets. that’s a clean split of your testing budget, and it falls out of the pattern for free.

the runnable version of this argument — define a skill, watch validation catch a bad booking, share one skill across two agents with different guardrails — is lesson 2 of agentic patterns. fourteen minutes, code panes included.

scalable labs·cvr 30091604·github·linkedin·hello@scalable.dk