// ultra-niche buildsby JoshApril 21, 20265 min read

Anthropic Prompt Caching: Real Production Numbers After 30 Days

Anthropic's prompt caching claims up to 90% cost reduction. I instrumented every call across two production services and tracked actual savings. Here's what landed.

Anthropic's prompt caching launched a while back with a claim of up to 90% cost reduction on cached portions. I turned it on across two production services 30 days ago and instrumented every call.

This is what actually landed.

The services

Service A: customer support agent. ~80,000 calls per month. Long system prompt with knowledge base, brand voice samples, refusal rules. Total system prompt: ~12,000 tokens. Each user turn adds a small amount of dynamic context.

Service B: code review bot. ~22,000 calls per month. System prompt + project-specific guidelines per repo (each repo's guidelines are stable). Total: ~8,000 tokens per repo's prompt context.

Both ran on Claude Sonnet 4.6.

How prompt caching works (briefly)

You mark portions of your prompt as cacheable. The first call writes those tokens to cache (at 1.25x normal input cost). Subsequent calls within 5 minutes hit the cache (at 0.1x normal input cost).

Cache lifetime defaults to 5 minutes. There's also a 1-hour cache tier that costs more to write but lasts longer.

You don't cache the user's specific message. You cache the stable scaffolding around it.

The numbers — Service A

Before caching: - 80,000 calls × ~12,500 avg input tokens × $3/M = ~$3,000/mo on input - Plus ~$450/mo on output - Total: ~$3,450/mo

After caching (system prompt + knowledge base marked cacheable, ~11,000 of 12,500 tokens):

-Cache writes: ~30 writes/day = ~900/month. 900 × 11,000 × $3.75/M = ~$37
-Cache hits: ~79,100/month. 79,100 × 11,000 × $0.30/M = ~$261
-Non-cacheable portion: 80,000 × 1,500 × $3/M = ~$360
-Output unchanged: ~$450
-Total: ~$1,108/mo

Savings: $2,342/mo, or 68%.

Lower than the 90% headline. Higher than "modest." For a customer-support agent at this scale, $28k/year back.

The numbers — Service B

Before caching: - 22,000 calls × ~8,500 avg input × $3/M = ~$561/mo on input - Plus ~$110/mo on output - Total: ~$671/mo

After caching (per-repo guidelines marked cacheable):

-Cache writes: maybe 200 writes (we only have 30 repos, but cache expires between bursts of activity)
-Cache hits: most of the rest
-Non-cacheable: tiny user message
-Total: ~$285/mo

Savings: $386/mo, or 58%.

Smaller absolute number. Roughly similar percentage.

Why not 90%

The 90% headline applies to the cacheable portion of the prompt, on cache hits. To get there in aggregate, you need: - Almost all of your prompt to be cacheable (we had 88% in Service A, 80% in Service B) - Almost all of your calls to be cache hits (we got 99% in Service A, 89% in Service B) - A workload where output tokens are small relative to input (we had 6% output by tokens in Service A)

If any of those three slip, your aggregate savings drop fast.

The cases where caching loses

Bursty traffic with gaps over 5 minutes. If your traffic pattern is "10 calls every 20 minutes," each burst writes a new cache. You pay write cost without enough hit cost to amortize.

For these workloads, use the 1-hour cache tier. It costs more to write but covers gappy patterns.

Highly dynamic prompts. If most of your prompt changes per call (different system prompt per user, different knowledge base per query), you can't cache much.

Very small prompts. If your system prompt is 500 tokens, caching helps less. The overhead of the cache write barely pays back.

What broke

Cache invalidation. When we updated a system prompt, all existing caches became stale. The first hour after a deploy had higher costs because every call was writing a new cache. Now we deploy prompt changes during low-traffic windows.

Token counting confusion. Anthropic's billing reports break out cached vs non-cached tokens but it took a few days to read the dashboards correctly.

Per-repo cache management for Service B. We initially treated each repo's prompt as a separate cache, but with 30+ repos and bursty per-repo traffic, hit rates were lower than expected. We restructured to a shared cacheable system prompt + per-repo dynamic context. Hit rate jumped to 89%.

What's worth knowing

Cache management requires thought, not just a flag. You can't just turn it on and walk away. The cache structure has to match your traffic pattern.

The 1-hour tier is underused. For bursty workloads, the longer cache pays back if you're at all consistent.

Output tokens are not cached. All these savings apply only to input. If you're output-heavy, the percentage savings are smaller.

Cache analytics matter. Track hit rate per service per day. If it drops, you've changed something.

What I'd tell a new user

Turn it on. Mark the stable portions (system prompt, knowledge base, examples) as cacheable. Track hit rate weekly.

Expect 40-70% input cost savings on real workloads. Closer to the high end if your traffic is steady and your prompts are stable.

Plan for the cache to invalidate on prompt updates and deploys. Time those carefully.

This is one of the highest-ROI configuration changes you can make on Anthropic. Most teams I see haven't turned it on. Most teams I see should.

anthropicprompt cachingcostclaudelong-tail

// go deeper

Want the full guide? Check out our deep-dive page for more context, FAQs, and resources.

read the full guide

// keep reading

// ultra-niche builds

Let's build yours.

Reading is the easy part. We do the work. Tell us what's broken and we'll tell you straight up whether we can help.