prompt caching costs less than you think
prompt caching is the cheapest infrastructure win in the anthropic api right now. i didn’t believe it until i ran the numbers on a real workload.
the workload
an internal coding agent with a 28k-token system prompt — tools, conventions, examples — and a per-turn user message of roughly 600 tokens. before caching: every turn paid for 28k input tokens at full rate. with caching: the first turn writes the cache (10% premium), every following turn within five minutes reads at 10% of the input rate.
the math
across a typical 6-turn session:
- without caching: 6 × 28,600 = 171,600 input tokens billed at full rate.
- with caching: 1 × 28,600 (write, +10%) + 5 × 28,600 read at 10%, plus 6 × 600 user tokens. roughly 31k token-equivalents. an 80% reduction on the input bill.
the cache ttl is 5 minutes. that ruins it for anything where users disappear and come back. but for a chat session where someone’s actively iterating, it’s almost free.
the gotchas
- the cache breakpoint has to land on a stable prefix. if you append a timestamp at the top of every prompt, you’ve cached nothing.
- system prompt + tools + few-shot examples should all be above the breakpoint. user input below.
- monitor the cache hit metric in the response. if it’s low, your prefix isn’t stable — find out why.
the cost of adding caching to an existing claude api integration is roughly twenty lines of code (cache_control: { type: "ephemeral" } on the relevant content blocks). the cost of not adding it is about 70-90% of your input bill, depending on workload shape.
run the numbers on yours. mine surprised me.