LLM bills compound fast. A team that started at $500/month is at $50k/month by month 9. The good news: most of that growth is wasted. With the right tactics you can cut the bill 50–90% without giving up anything users notice. This post is the working playbook.
The cost shape
Three multipliers stack:
total_cost ≈ (input_tokens × in_price) + (output_tokens × out_price)
Each lever attacks a different part:
| Lever | Attacks | Typical savings |
|---|---|---|
| Prompt caching | input_tokens × in_price | 60–90% on repeated prefixes |
| Model routing | in_price + out_price | 30–80% via cheaper models |
| Output bounds | output_tokens | 20–50% |
| Semantic caching | repeated calls | 30–70% on FAQ-shape traffic |
| Batching | both prices × 0.5 | 50% for batch-able work |
| Fine-tuning | both prices | 80–95% on narrow tasks |
You want to apply as many as fit. The savings compound.
1. Prompt caching (do this first)
Anthropic and OpenAI both support prompt caching: mark a stable prefix; subsequent requests with that prefix get billed at ~10% of input price.
client.messages.create(
model="claude-sonnet-4-6",
system=[{
"type": "text",
"text": LARGE_SYSTEM_PROMPT, # 5k tokens
"cache_control": {"type": "ephemeral"},
}],
messages=[{"role": "user", "content": question}],
)
For a chatbot with a 5k-token system prompt + 1k tokens of conversation history × 1000 conversations/day, caching is the difference between $7.50 and $0.75 per day.
Cache up to 4 markers. Place them at: end of system prompt, end of tool definitions, end of static reference docs, end of fixed conversation prefix. Everything after the last marker is dynamic and pays full price.
I covered the mechanics in Anthropic Claude API + Tool Use Guide .
2. Model routing
Most apps use one model for everything. Most apps would do fine using a small model for 80% of traffic and a big model only for the hard 20%.
def pick_model(task_type: str, input_length: int) -> str:
if task_type == "classification": return "claude-haiku-4-5"
if task_type == "extraction": return "claude-haiku-4-5"
if task_type == "summarization" and input_length < 5000: return "claude-haiku-4-5"
if task_type in {"reasoning", "code_review"}: return "claude-opus-4-7"
return "claude-sonnet-4-6"
Smarter version: a fast classifier (Haiku) routes to the right model for each request. The router cost is negligible; the savings are huge.
For more on routing infrastructure see AI Gateways .
3. Output bounds
client.messages.create(
...,
max_tokens=400, # not 4096
)
The default max_tokens is generous. Most production responses are 100–500 tokens. Setting max_tokens to the realistic worst case stops a model that decides to write an essay.
For structured output, this is even more important — a tool call rarely needs more than 200 tokens. Set the cap.
4. Semantic caching
For FAQ-shaped traffic (“what’s our refund policy”, “how do I reset my password”), embed the query and look up similar past queries:
async def answer(query: str) -> str:
embedding = await embed(query)
similar = await pgvector_search(embedding, threshold=0.95)
if similar and similar.score > 0.95:
return similar.cached_response
response = await llm(query)
await cache.store(query, embedding, response)
return response
Be careful: similarity ≠ identity. A 0.95 cutoff is conservative; tune on real data. Don’t cache personalized or context-dependent responses.
Pair with Build a RAG App with pgvector for the embedding infra.
5. Batching
Anthropic Message Batches and OpenAI Batch API both run jobs at 50% discount with 24-hour SLA:
batch = client.messages.batches.create(requests=[
{"custom_id": "doc-1", "params": {"model": "claude-sonnet-4-6", ...}},
{"custom_id": "doc-2", "params": {...}},
# ... up to 10k requests
])
# Poll later
Use cases that fit:
- Bulk summarization / extraction over a corpus.
- Embedding generation (OpenAI’s embedding API is also batchable).
- Eval runs in CI.
- Nightly content rewrites.
Half the cost. Trade is the latency.
6. Smaller models, fine-tuned
If you have one task that runs millions of times, a LoRA fine-tune of a 7B-14B beats the 70B prompted equivalent at 10–30× lower cost per call.
A typical math: classification at $0.005/call (Sonnet) × 5M calls/month = $25k. Fine-tune Llama 3.1 8B (one-time $200) → $0.0003/call × 5M = $1.5k/month. Saves $23k/month, payback in days.
See Fine-Tuning vs RAG vs Prompting in 2026 for the decision tree.
7. RAG instead of context stuffing
Instead of loading 50k tokens of “context” on every call, retrieve the relevant 2k:
| Approach | Tokens / call |
|---|---|
| Stuff full corpus | 50,000 |
| RAG over corpus | 2,000 |
25× fewer input tokens → 25× lower input cost. See Build a RAG App .
8. Streaming for perceived latency, not cost
Streaming doesn’t reduce cost. But it lets you cap generation early when the user has what they need (stop_sequences). And it lets the user cancel early — which sometimes saves tokens.
async def chat_with_cancel(prompt, cancel_event):
async with client.messages.stream(...) as stream:
async for chunk in stream:
if cancel_event.is_set():
return
yield chunk
For mechanics see SSE vs WebSockets in 2026 .
9. Stop tokens
stop_sequences=["\n\nUser:", "\nQuestion:"]
If your prompt template is “User: … Assistant: …”, and the model sometimes hallucinates a follow-up turn, stop tokens cut it. Saves output tokens directly.
10. Distillation
Once you have a working LLM pipeline, log inputs and outputs. Use the logs to fine-tune a smaller model that mimics the big one.
This is distillation: the big model is the teacher, the small one is the student. Combined with #6, you can sometimes get away with a 1B model where you started with a 70B.
11. Eval-driven swaps
A model upgrade looks good in benchmarks. On your eval set, it might be 2% worse. Run evals on every change. Sometimes the cheaper model wins.
12. Cost dashboards
Track per-feature, per-customer, per-route cost. Without this, you’re flying blind. The simplest version:
CREATE TABLE llm_calls (
id BIGSERIAL PRIMARY KEY,
ts TIMESTAMPTZ DEFAULT now(),
feature TEXT,
customer_id BIGINT,
model TEXT,
input_tokens INT,
output_tokens INT,
cost_usd NUMERIC(10, 6)
);
Rolled up daily, charted by feature. Now “feature X is suddenly $5k/day” is detectable.
A real-world stack
For a SaaS in 2026 doing 1M LLM calls/day:
- Prompt caching for system prompts and tool definitions.
- Haiku for 70% of traffic; Sonnet for 25%; Opus for 5%.
- 200ms
max_tokens=300default; bumped for specific tasks. - pgvector semantic cache for FAQ-shaped questions.
- Daily batch summarization runs.
- Per-feature cost dashboard.
- Quarterly review: which features are eating budget? Which can move to a smaller model?
Without these tactics: $50k/month. With them: $5–10k. Same product.
What I’d do day one
If you have an LLM app and haven’t tried any of this:
- Add prompt caching — hours of work.
- Set
max_tokenshonestly — minutes. - Add a per-feature cost log — half a day.
- Look at the top feature; can it use a smaller model?
Three tactics in a week, often 60% off the bill.
Read this next
- Anthropic Claude API + Tool Use Guide
- Fine-Tuning vs RAG vs Prompting
- AI Gateways in 2026
- LLM Evaluations
If you want my LLM cost-optimization checklist + cost dashboard SQL, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .