LLM bills add up fast. A side project at $100/month becomes a feature at $50k/month. The good news: most apps over-spend by 5–20× on inference. This post is the working playbook for cutting that.
The cost levers (ordered by impact)
- Prompt caching — for repeated context.
- Model routing — easy → cheap, hard → premium.
- Batch API — 50% off if latency tolerant.
- Structured output — fewer tokens.
- Fine-tuning — for high-volume narrow tasks.
- Caching responses — avoid redundant calls.
- Smaller context windows — only relevant chunks.
Prompt caching
The biggest win in 2026.
client.messages.create(
model="claude-sonnet-4-6",
system=[
{
"type": "text",
"text": HUGE_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": user_query}],
)
The system prompt is cached for 5 minutes. Re-uses cost ~10% of input tokens. For an agent loop calling 20 times per session: a 90% reduction in input cost.
For RAG with stable retrieved context: cache the retrieved chunks too.
Model routing
async def route(query: str) -> str:
classification = await haiku.classify(query) # cheap
if classification.complexity == "trivial":
return await haiku.answer(query)
if classification.complexity == "moderate":
return await sonnet.answer(query)
return await opus.answer(query) # only the hard ones
Most queries are easy. Don’t waste premium-model dollars on them. See LLM Routing .
Batch API
batch = await client.messages.batches.create(
requests=[
{"custom_id": f"req-{i}", "params": {...}}
for i in range(1000)
]
)
# Poll until done; results back at 50% price
For nightly jobs, eval runs, content generation: 50% off. Up to 24h delivery. For latency-tolerant work this is free money.
Structured output
# Verbose: free-form text response
"Please return a JSON object with fields..." # +400 tokens of instructions
# Concise: tool calling
tools=[{"name": "answer", "input_schema": ResponseSchema.model_json_schema()}]
Schema is enforced; no need for verbose instructions. Output tokens drop too. See Structured Output .
Fine-tunes for high volume
For a narrow task running at high volume:
- Replace Sonnet ($3/MTok input) with fine-tuned Llama 8B ($0.20/MTok equivalent self-hosted).
- 15× cheaper, comparable quality on the narrow task.
- Training cost: $5–500 (one-time).
Real example: a classifier processing 10M queries/month at $3/1M input went from $30k/month to $2k/month after a fine-tune. Training cost: $50.
See Fine-Tuning LoRA / QLoRA .
Response caching
For deterministic outputs:
import hashlib
async def cached_complete(prompt: str) -> str:
key = f"llm:{hashlib.sha256(prompt.encode()).hexdigest()}"
if hit := await redis.get(key):
return hit.decode()
result = await llm.complete(prompt)
await redis.set(key, result, ex=86400)
return result
Common questions (“What is X?”) answer once. Subsequent identical prompts: free.
For semantic caching (similar but not identical): use embedding similarity .
Trim context
# BAD: include all conversation history
messages = entire_history # 50 messages
# GOOD: keep last N + summary of older
messages = [
{"role": "system", "content": f"Earlier: {summary_of_old}"},
*recent_5_messages,
]
LLM context cost is linear in tokens. Old context rarely matters; summarize and drop.
Off-load to embeddings
Many “AI” tasks are actually similarity matching:
# BAD: ask LLM "is this question similar to that one?"
# COST: ~$0.01 per pair
# GOOD: precompute embeddings, cosine similarity
# COST: <$0.0001 per pair
For deduplication, classification, search, FAQ matching: embeddings are 100× cheaper than LLM calls. See Embeddings & Semantic Search .
Self-host for sustained volume
| Volume | Best |
|---|---|
| <100M tokens/month | API |
| 100M–1B | API + caching + routing |
| >1B sustained | Self-host (vLLM) considered |
Self-hosting at scale: 5–10× cheaper but adds GPU ops. For sustained heavy loads, it pays. See Self-Hosted LLMs .
Streaming doesn’t save cost
Common confusion: streaming doesn’t reduce token spend; it just reduces perceived latency. You pay for the same tokens.
But streaming + early-stop can save: stop the stream when you see what you need, skipping remaining tokens.
Cost observability
Tag every LLM call with metadata:
resp = await client.messages.create(
model="claude-sonnet-4-6",
metadata={"user_id": str(user_id), "feature": "summarize"},
...
)
Roll up by feature / user. Find the 80/20 — usually one feature dominates spend. Optimize there first.
Audit your spend
Walk through a week of usage:
- Top 10 prompts by spend. Are they cacheable?
- Top 10 features by spend. Can any be downgraded to a smaller model?
- % of calls hitting cache. If <50%, prompt caching is missing.
- Average input tokens. If huge, can you trim?
- Output tokens. Can you cap with
max_tokens?
Most teams find $5k–$50k/month savings from this audit alone.
Common mistakes
1. One model for everything
Premium model for trivial classification. Route by difficulty.
2. No prompt caching
Stable system prompts re-sent in full every request. Free 80% savings on the table.
3. Sync API calls when batch would work
Nightly eval pipelines using sync API at full price. Batch API is 50% off.
4. LLM for tasks embeddings can do
“Is this similar?” / “What category?” — often embedding similarity / classifier suffice.
5. No max_tokens
Model rambles to 4000 tokens when 200 sufficed. Set max_tokens per use case.
What I’d ship today
For an existing AI app:
- Add prompt caching to system prompts and stable RAG context. Day-one win.
- Add a router with Haiku for trivial / Sonnet for hard.
- Use batch API for non-realtime work.
- Cache responses via Redis where deterministic.
- Audit monthly; tune the 80/20.
Read this next
If you want my LLM cost-audit checklist + caching layer, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .