Prompt caching is the cheapest LLM cost optimization in 2026. Done right, your bill drops 60–90% on workloads with repeated prefixes. Done wrong, you don’t realize you’re paying full price. This post is the deep dive.
How it works
Each provider keeps the KV-cache state of stable prefixes. When a new request arrives with the same prefix, the prefix’s compute is skipped. Pricing reflects the savings.
| Cache mechanism | Hit price | Miss price | TTL | |
|---|---|---|---|---|
| Anthropic | cache_control markers | 10% of input | 125% of input (write) | 5 min default |
| OpenAI | Automatic on >1024-token prefix | 50% of input | 100% | ~5–10 min |
| Google Gemini | Explicit cachedContent | varies | varies | configurable |
For mechanics see Anthropic Claude API + Tool Use Guide .
Where to place breakpoints (Anthropic)
client.messages.create(
model="claude-sonnet-4-6",
system=[
{
"type": "text",
"text": LARGE_SYSTEM_PROMPT, # 5k tokens, stable
"cache_control": {"type": "ephemeral"},
},
],
tools=[
# tool defs here, also stable
],
messages=[
{"role": "user", "content": [
{"type": "text", "text": REFERENCE_DOC, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": question}, # dynamic; not cached
]},
],
)
Up to 4 cache breakpoints. Place them at:
- End of stable system prompt.
- End of tool definitions.
- End of static reference content (if any).
- End of conversation prefix (if multi-turn).
Everything before a breakpoint up to the previous one is cached.
Order matters
Caching is prefix-based. If your message order differs from request to request, cache hits drop. Stable order:
system → tools → memory → conversation prefix → current message
Within each, deterministic ordering (e.g., tools sorted by name). See Context Engineering for LLMs .
TTL and warming
Anthropic’s ephemeral cache lasts ~5 minutes. After that, the next request pays the write fee (125%) again to recreate.
For high-traffic apps: requests come fast enough that the cache stays warm. For low-traffic: most requests pay write fees.
Mitigations:
- Synthetic warmer: a low-frequency request that touches the cache to keep it warm.
- 1-hour cache (Anthropic introduced; check current pricing): higher write fee but longer TTL.
Measuring hit rate
resp = client.messages.create(...)
print(resp.usage)
# Usage(input_tokens=200, cache_creation_input_tokens=0, cache_read_input_tokens=4800, ...)
The ratio cache_read / (cache_read + input) is your hit rate. Aim for >80% on cacheable prefixes.
For monitoring see LLM Observability .
OpenAI specifics
OpenAI auto-caches prefixes >1024 tokens. No markers needed. Match the same prefix → cache hits. Pricing: 50% of input on hits.
For OpenAI, the engineering is avoid breaking the prefix:
- Order system messages identically.
- Don’t insert per-request data into the cached prefix.
- For RAG, retrieved chunks change per query — they go AFTER the stable prefix.
Common mistakes
1. Per-request data inside the cache
system = f"""You are an assistant. Today is {today_date}.
Long stable instructions..."""
today_date invalidates the cache every day. Move dynamic data out of the cached prefix.
2. Breakpoint after the dynamic message
messages=[
{"role": "user", "content": [
{"type": "text", "text": question}, # dynamic
{"type": "text", "text": REF_DOC, "cache_control": {"type": "ephemeral"}}, # too late
]}
]
Cache control marks “cache UP TO HERE.” If the question is before the marker, the cache is invalidated by every new question. REF_DOC needs to be FIRST.
3. Different system prompt per call
Tweaking the system prompt per request defeats caching. Pin it.
4. No cache hit metric
You assume you’re caching but it’s silently broken. Always check cache_read_input_tokens.
5. Not caching tool definitions
Tools are usually stable. Mark them.
Realistic savings
For a chatbot with:
- 5k-token system prompt.
- 1k-token tool definitions.
- ~500-token conversation prefix.
- ~100 dynamic tokens per query.
Without caching: ~6.6k input tokens × $3/MTok = $0.020/call.
With caching (assuming 80% hit rate after warmup):
- 80% of calls: ~6.5k cache hit + 100 dynamic = $0.0023/call.
- 20% miss: $0.025/call.
- Average: ~$0.007/call.
~65% cost reduction. At 1M calls/month: $20k → $7k. Real money.
What I’d ship today
- Audit prompt structure: identify what’s stable, what’s dynamic.
- Place breakpoints at stable boundaries.
- Pin tool definitions and system prompt in stable order.
- Track
cache_read_input_tokensin observability. - Alert on cache miss rate above expected.
A weekend’s work. Permanent savings.
Read this next
- Anthropic Claude API + Tool Use Guide
- LLM Cost Optimization in 2026
- Context Engineering for LLMs
- LLM Routing in 2026 — Use Haiku to Save 80%
If you want my prompt-cache audit script + observability dashboard, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .