prompt caching costs less than you think

prompt caching is the cheapest infrastructure win in the anthropic api right now. i didn’t believe it until i ran the numbers on a real workload.

the workload

an internal coding agent with a 28k-token system prompt — tools, conventions, examples — and a per-turn user message of roughly 600 tokens. before caching: every turn paid for 28k input tokens at full rate. with caching: the first turn writes the cache (10% premium), every following turn within five minutes reads at 10% of the input rate.

the math

across a typical 6-turn session:

without caching: 6 × 28,600 = 171,600 input tokens billed at full rate.
with caching: 1 × 28,600 (write, +10%) + 5 × 28,600 read at 10%, plus 6 × 600 user tokens. roughly 31k token-equivalents. an 80% reduction on the input bill.

the cache ttl is 5 minutes. that ruins it for anything where users disappear and come back. but for a chat session where someone’s actively iterating, it’s almost free.

the gotchas

the cache breakpoint has to land on a stable prefix. if you append a timestamp at the top of every prompt, you’ve cached nothing.
system prompt + tools + few-shot examples should all be above the breakpoint. user input below.
monitor the cache hit metric in the response. if it’s low, your prefix isn’t stable — find out why.

the cost of adding caching to an existing claude api integration is roughly twenty lines of code (cache_control: { type: "ephemeral" } on the relevant content blocks). the cost of not adding it is about 70-90% of your input bill, depending on workload shape.

run the numbers on yours. mine surprised me.