Prompt caching is the cheapest LLM cost optimization in 2026. Done right, your bill drops 60–90% on workloads with repeated prefixes. Done wrong, you don’t realize you’re paying full price. This post is the deep dive.

How it works

Each provider keeps the KV-cache state of stable prefixes. When a new request arrives with the same prefix, the prefix’s compute is skipped. Pricing reflects the savings.

Cache mechanismHit priceMiss priceTTL
Anthropiccache_control markers10% of input125% of input (write)5 min default
OpenAIAutomatic on >1024-token prefix50% of input100%~5–10 min
Google GeminiExplicit cachedContentvariesvariesconfigurable

For mechanics see Anthropic Claude API + Tool Use Guide .

Where to place breakpoints (Anthropic)

client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT,           # 5k tokens, stable
            "cache_control": {"type": "ephemeral"},
        },
    ],
    tools=[
        # tool defs here, also stable
    ],
    messages=[
        {"role": "user", "content": [
            {"type": "text", "text": REFERENCE_DOC, "cache_control": {"type": "ephemeral"}},
            {"type": "text", "text": question},     # dynamic; not cached
        ]},
    ],
)

Up to 4 cache breakpoints. Place them at:

  1. End of stable system prompt.
  2. End of tool definitions.
  3. End of static reference content (if any).
  4. End of conversation prefix (if multi-turn).

Everything before a breakpoint up to the previous one is cached.

Order matters

Caching is prefix-based. If your message order differs from request to request, cache hits drop. Stable order:

system  tools  memory  conversation prefix  current message

Within each, deterministic ordering (e.g., tools sorted by name). See Context Engineering for LLMs .

TTL and warming

Anthropic’s ephemeral cache lasts ~5 minutes. After that, the next request pays the write fee (125%) again to recreate.

For high-traffic apps: requests come fast enough that the cache stays warm. For low-traffic: most requests pay write fees.

Mitigations:

  • Synthetic warmer: a low-frequency request that touches the cache to keep it warm.
  • 1-hour cache (Anthropic introduced; check current pricing): higher write fee but longer TTL.

Measuring hit rate

resp = client.messages.create(...)
print(resp.usage)
# Usage(input_tokens=200, cache_creation_input_tokens=0, cache_read_input_tokens=4800, ...)

The ratio cache_read / (cache_read + input) is your hit rate. Aim for >80% on cacheable prefixes.

For monitoring see LLM Observability .

OpenAI specifics

OpenAI auto-caches prefixes >1024 tokens. No markers needed. Match the same prefix → cache hits. Pricing: 50% of input on hits.

For OpenAI, the engineering is avoid breaking the prefix:

  • Order system messages identically.
  • Don’t insert per-request data into the cached prefix.
  • For RAG, retrieved chunks change per query — they go AFTER the stable prefix.

Common mistakes

1. Per-request data inside the cache

system = f"""You are an assistant. Today is {today_date}.
Long stable instructions..."""

today_date invalidates the cache every day. Move dynamic data out of the cached prefix.

2. Breakpoint after the dynamic message

messages=[
    {"role": "user", "content": [
        {"type": "text", "text": question},      # dynamic
        {"type": "text", "text": REF_DOC, "cache_control": {"type": "ephemeral"}},  # too late
    ]}
]

Cache control marks “cache UP TO HERE.” If the question is before the marker, the cache is invalidated by every new question. REF_DOC needs to be FIRST.

3. Different system prompt per call

Tweaking the system prompt per request defeats caching. Pin it.

4. No cache hit metric

You assume you’re caching but it’s silently broken. Always check cache_read_input_tokens.

5. Not caching tool definitions

Tools are usually stable. Mark them.

Realistic savings

For a chatbot with:

  • 5k-token system prompt.
  • 1k-token tool definitions.
  • ~500-token conversation prefix.
  • ~100 dynamic tokens per query.

Without caching: ~6.6k input tokens × $3/MTok = $0.020/call.

With caching (assuming 80% hit rate after warmup):

  • 80% of calls: ~6.5k cache hit + 100 dynamic = $0.0023/call.
  • 20% miss: $0.025/call.
  • Average: ~$0.007/call.

~65% cost reduction. At 1M calls/month: $20k → $7k. Real money.

What I’d ship today

  1. Audit prompt structure: identify what’s stable, what’s dynamic.
  2. Place breakpoints at stable boundaries.
  3. Pin tool definitions and system prompt in stable order.
  4. Track cache_read_input_tokens in observability.
  5. Alert on cache miss rate above expected.

A weekend’s work. Permanent savings.

Read this next

If you want my prompt-cache audit script + observability dashboard, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .