How accurate does the classifier need to be?

85%+ on routing decisions is plenty for most apps. False-positive (sent easy to expensive) is fine; false-negative (sent hard to cheap) degrades quality. Tune the classifier to err on the safe side.

Doesn't the classifier add latency?

Haiku-class classification is ~200ms. On a Sonnet/Opus call that's 1–3 seconds, the routing call is rounding error and saves 5–10× the cost when easy routes hit.

LLM Routing in 2026 — Use Haiku to Save 80% on Sonnet/Opus Bills

The cheapest LLM cost optimization in 2026: stop running every query on Sonnet or Opus. Most queries are easy and Haiku handles them at 5–10% the price. This post is the routing pattern.

The realization

A typical LLM app’s query mix:

70% are easy. “Summarize this 200-token block.” “Classify into 5 buckets.” “Extract this field.”
25% are medium. “Write a polite reply to this complaint.” “Generate 3 variations of this headline.”
5% are hard. “Reason through this multi-step argument.” “Refactor this code while preserving semantics.”

If you run all 100% on Opus 4.7, you pay Opus prices. Route the 70% to Haiku, the 25% to Sonnet, and only the 5% to Opus — total bill drops 75–85%.

A simple router

class LLMRouter:
    async def route(self, query: str, context: dict) -> str:
        difficulty = await self.classify(query)
        if difficulty == "easy":
            return await self.haiku(query, context)
        if difficulty == "medium":
            return await self.sonnet(query, context)
        return await self.opus(query, context)

    async def classify(self, query: str) -> str:
        # Fast Haiku classification
        resp = await self.client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=20,
            tools=[{
                "name": "rate_difficulty",
                "input_schema": {
                    "type": "object",
                    "properties": {"level": {"enum": ["easy", "medium", "hard"]}},
                    "required": ["level"],
                },
            }],
            tool_choice={"type": "tool", "name": "rate_difficulty"},
            system=ROUTER_SYSTEM_PROMPT,
            messages=[{"role": "user", "content": query}],
        )
        return resp.content[0].input["level"]

The classifier is itself an LLM call — but Haiku’s $1/MTok input is rounding error compared to Opus’s $15/MTok.

ROUTER_SYSTEM_PROMPT

You rate the difficulty of LLM queries to route them appropriately.

EASY: short factual questions, single-document Q&A, classification, extraction.
MEDIUM: multi-paragraph generation, polite-reply generation, simple reasoning.
HARD: multi-step reasoning, code refactor, math, architectural decisions.

When uncertain, prefer MEDIUM over EASY (errs toward higher quality).

The “uncertain → medium” rule is important. False-negatives (hard query routed to easy model) degrade output quality; you’d rather over-pay sometimes than under-deliver.

Better than classifier: feature-based router

Skip the LLM classifier. Use cheap features:

def route_by_features(query: str, context: dict) -> str:
    if len(query) < 200 and is_classification_shape(query):
        return "haiku"
    if needs_multistep_reasoning(query):
        return "opus"
    if context.get("user_tier") == "free":
        return "haiku"
    return "sonnet"

Faster (no LLM call), simpler, deterministic. Combined with a fallback “if Haiku output looks bad, retry with Sonnet” gives you a self-correcting system.

Cost math

Same workload, 1M queries / month:

All on Opus: ~$15k/month.
All on Sonnet: ~$5k/month.
Routed (70% Haiku, 25% Sonnet, 5% Opus): ~$1.5k/month.

That’s a 90% reduction. Engineering investment: a day to wire it in.

Quality math

The risk is “Haiku gets it wrong on a hard query.” Mitigations:

Eval set that scores Haiku vs Sonnet vs Opus on representative queries. See LLM Evaluations .
Confidence scoring — if Haiku’s output has structured “confidence” field below threshold, escalate.
A/B test in production: 5% of “easy” routes go to Sonnet too; compare quality.

The router pattern in production

Query
  ↓
  Router (feature- or classifier-based)
  ↓
  ┌────────────┬──────────┬────────────┐
  Haiku        Sonnet      Opus         vLLM-hosted Llama
  (cheap)      (default)   (hard)       (volume work)

For volume work (extraction, classification), self-host Llama 3.3 8B fine-tuned on your data. See Self-Hosted LLMs .

Read this next

If you want a working router with eval harness and shadow A/B, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

The realization#

A simple router#

ROUTER_SYSTEM_PROMPT#

Better than classifier: feature-based router#

Cost math#

Quality math#

The router pattern in production#

Read this next#