The cheapest LLM cost optimization in 2026: stop running every query on Sonnet or Opus. Most queries are easy and Haiku handles them at 5–10% the price. This post is the routing pattern.
The realization
A typical LLM app’s query mix:
- 70% are easy. “Summarize this 200-token block.” “Classify into 5 buckets.” “Extract this field.”
- 25% are medium. “Write a polite reply to this complaint.” “Generate 3 variations of this headline.”
- 5% are hard. “Reason through this multi-step argument.” “Refactor this code while preserving semantics.”
If you run all 100% on Opus 4.7, you pay Opus prices. Route the 70% to Haiku, the 25% to Sonnet, and only the 5% to Opus — total bill drops 75–85%.
A simple router
class LLMRouter:
async def route(self, query: str, context: dict) -> str:
difficulty = await self.classify(query)
if difficulty == "easy":
return await self.haiku(query, context)
if difficulty == "medium":
return await self.sonnet(query, context)
return await self.opus(query, context)
async def classify(self, query: str) -> str:
# Fast Haiku classification
resp = await self.client.messages.create(
model="claude-haiku-4-5",
max_tokens=20,
tools=[{
"name": "rate_difficulty",
"input_schema": {
"type": "object",
"properties": {"level": {"enum": ["easy", "medium", "hard"]}},
"required": ["level"],
},
}],
tool_choice={"type": "tool", "name": "rate_difficulty"},
system=ROUTER_SYSTEM_PROMPT,
messages=[{"role": "user", "content": query}],
)
return resp.content[0].input["level"]
The classifier is itself an LLM call — but Haiku’s $1/MTok input is rounding error compared to Opus’s $15/MTok.
ROUTER_SYSTEM_PROMPT
You rate the difficulty of LLM queries to route them appropriately.
EASY: short factual questions, single-document Q&A, classification, extraction.
MEDIUM: multi-paragraph generation, polite-reply generation, simple reasoning.
HARD: multi-step reasoning, code refactor, math, architectural decisions.
When uncertain, prefer MEDIUM over EASY (errs toward higher quality).
The “uncertain → medium” rule is important. False-negatives (hard query routed to easy model) degrade output quality; you’d rather over-pay sometimes than under-deliver.
Better than classifier: feature-based router
Skip the LLM classifier. Use cheap features:
def route_by_features(query: str, context: dict) -> str:
if len(query) < 200 and is_classification_shape(query):
return "haiku"
if needs_multistep_reasoning(query):
return "opus"
if context.get("user_tier") == "free":
return "haiku"
return "sonnet"
Faster (no LLM call), simpler, deterministic. Combined with a fallback “if Haiku output looks bad, retry with Sonnet” gives you a self-correcting system.
Cost math
Same workload, 1M queries / month:
- All on Opus: ~$15k/month.
- All on Sonnet: ~$5k/month.
- Routed (70% Haiku, 25% Sonnet, 5% Opus): ~$1.5k/month.
That’s a 90% reduction. Engineering investment: a day to wire it in.
Quality math
The risk is “Haiku gets it wrong on a hard query.” Mitigations:
- Eval set that scores Haiku vs Sonnet vs Opus on representative queries. See LLM Evaluations .
- Confidence scoring — if Haiku’s output has structured “confidence” field below threshold, escalate.
- A/B test in production: 5% of “easy” routes go to Sonnet too; compare quality.
The router pattern in production
Query
↓
Router (feature- or classifier-based)
↓
┌────────────┬──────────┬────────────┐
Haiku Sonnet Opus vLLM-hosted Llama
(cheap) (default) (hard) (volume work)
For volume work (extraction, classification), self-host Llama 3.3 8B fine-tuned on your data. See Self-Hosted LLMs .
Read this next
- LLM Cost Optimization in 2026
- Fine-Tuning vs RAG vs Prompting in 2026
- LLM Evaluations
- AI Gateways in 2026
If you want a working router with eval harness and shadow A/B, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .