LLM bills add up fast. A side project at $100/month becomes a feature at $50k/month. The good news: most apps over-spend by 5–20× on inference. This post is the working playbook for cutting that.

The cost levers (ordered by impact)

  1. Prompt caching — for repeated context.
  2. Model routing — easy → cheap, hard → premium.
  3. Batch API — 50% off if latency tolerant.
  4. Structured output — fewer tokens.
  5. Fine-tuning — for high-volume narrow tasks.
  6. Caching responses — avoid redundant calls.
  7. Smaller context windows — only relevant chunks.

Prompt caching

The biggest win in 2026.

client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {
            "type": "text",
            "text": HUGE_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": user_query}],
)

The system prompt is cached for 5 minutes. Re-uses cost ~10% of input tokens. For an agent loop calling 20 times per session: a 90% reduction in input cost.

For RAG with stable retrieved context: cache the retrieved chunks too.

Model routing

async def route(query: str) -> str:
    classification = await haiku.classify(query)  # cheap
    if classification.complexity == "trivial":
        return await haiku.answer(query)
    if classification.complexity == "moderate":
        return await sonnet.answer(query)
    return await opus.answer(query)  # only the hard ones

Most queries are easy. Don’t waste premium-model dollars on them. See LLM Routing .

Batch API

batch = await client.messages.batches.create(
    requests=[
        {"custom_id": f"req-{i}", "params": {...}}
        for i in range(1000)
    ]
)
# Poll until done; results back at 50% price

For nightly jobs, eval runs, content generation: 50% off. Up to 24h delivery. For latency-tolerant work this is free money.

Structured output

# Verbose: free-form text response
"Please return a JSON object with fields..."  # +400 tokens of instructions

# Concise: tool calling
tools=[{"name": "answer", "input_schema": ResponseSchema.model_json_schema()}]

Schema is enforced; no need for verbose instructions. Output tokens drop too. See Structured Output .

Fine-tunes for high volume

For a narrow task running at high volume:

  • Replace Sonnet ($3/MTok input) with fine-tuned Llama 8B ($0.20/MTok equivalent self-hosted).
  • 15× cheaper, comparable quality on the narrow task.
  • Training cost: $5–500 (one-time).

Real example: a classifier processing 10M queries/month at $3/1M input went from $30k/month to $2k/month after a fine-tune. Training cost: $50.

See Fine-Tuning LoRA / QLoRA .

Response caching

For deterministic outputs:

import hashlib

async def cached_complete(prompt: str) -> str:
    key = f"llm:{hashlib.sha256(prompt.encode()).hexdigest()}"
    if hit := await redis.get(key):
        return hit.decode()
    
    result = await llm.complete(prompt)
    await redis.set(key, result, ex=86400)
    return result

Common questions (“What is X?”) answer once. Subsequent identical prompts: free.

For semantic caching (similar but not identical): use embedding similarity .

Trim context

# BAD: include all conversation history
messages = entire_history  # 50 messages

# GOOD: keep last N + summary of older
messages = [
    {"role": "system", "content": f"Earlier: {summary_of_old}"},
    *recent_5_messages,
]

LLM context cost is linear in tokens. Old context rarely matters; summarize and drop.

Off-load to embeddings

Many “AI” tasks are actually similarity matching:

# BAD: ask LLM "is this question similar to that one?"
# COST: ~$0.01 per pair

# GOOD: precompute embeddings, cosine similarity
# COST: <$0.0001 per pair

For deduplication, classification, search, FAQ matching: embeddings are 100× cheaper than LLM calls. See Embeddings & Semantic Search .

Self-host for sustained volume

VolumeBest
<100M tokens/monthAPI
100M–1BAPI + caching + routing
>1B sustainedSelf-host (vLLM) considered

Self-hosting at scale: 5–10× cheaper but adds GPU ops. For sustained heavy loads, it pays. See Self-Hosted LLMs .

Streaming doesn’t save cost

Common confusion: streaming doesn’t reduce token spend; it just reduces perceived latency. You pay for the same tokens.

But streaming + early-stop can save: stop the stream when you see what you need, skipping remaining tokens.

Cost observability

Tag every LLM call with metadata:

resp = await client.messages.create(
    model="claude-sonnet-4-6",
    metadata={"user_id": str(user_id), "feature": "summarize"},
    ...
)

Roll up by feature / user. Find the 80/20 — usually one feature dominates spend. Optimize there first.

Audit your spend

Walk through a week of usage:

  1. Top 10 prompts by spend. Are they cacheable?
  2. Top 10 features by spend. Can any be downgraded to a smaller model?
  3. % of calls hitting cache. If <50%, prompt caching is missing.
  4. Average input tokens. If huge, can you trim?
  5. Output tokens. Can you cap with max_tokens?

Most teams find $5k–$50k/month savings from this audit alone.

Common mistakes

1. One model for everything

Premium model for trivial classification. Route by difficulty.

2. No prompt caching

Stable system prompts re-sent in full every request. Free 80% savings on the table.

3. Sync API calls when batch would work

Nightly eval pipelines using sync API at full price. Batch API is 50% off.

4. LLM for tasks embeddings can do

“Is this similar?” / “What category?” — often embedding similarity / classifier suffice.

5. No max_tokens

Model rambles to 4000 tokens when 200 sufficed. Set max_tokens per use case.

What I’d ship today

For an existing AI app:

  1. Add prompt caching to system prompts and stable RAG context. Day-one win.
  2. Add a router with Haiku for trivial / Sonnet for hard.
  3. Use batch API for non-realtime work.
  4. Cache responses via Redis where deterministic.
  5. Audit monthly; tune the 80/20.

Read this next

If you want my LLM cost-audit checklist + caching layer, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .