What's the biggest cost lever for LLM apps?

Prompt caching. A 90% discount on cached input tokens turns large stable system prompts and reference contexts from a steady drag into rounding error. Almost every production app should be using it; almost no early-stage app is.

Should I fine-tune a smaller model to cut costs?

Yes if you have one or two narrow high-volume tasks. A LoRA on a 7B-14B base often matches a 70B on the narrow task at 5–20× lower cost. Run the math on token volumes — see fine-tuning vs RAG vs prompting post.

Is batching worth it?

For non-interactive workloads, yes. Anthropic and OpenAI both ship batch APIs at 50% discount for jobs that can wait up to 24 hours. Embedding pipelines, eval runs, and offline summarization all qualify.

LLM Cost Optimization in 2026 — Tactics That Cut Bills 50–90%

LLM bills compound fast. A team that started at $500/month is at $50k/month by month 9. The good news: most of that growth is wasted. With the right tactics you can cut the bill 50–90% without giving up anything users notice. This post is the working playbook.

The cost shape

Three multipliers stack:

total_cost ≈ (input_tokens × in_price) + (output_tokens × out_price)

Each lever attacks a different part:

Lever	Attacks	Typical savings
Prompt caching	input_tokens × in_price	60–90% on repeated prefixes
Model routing	in_price + out_price	30–80% via cheaper models
Output bounds	output_tokens	20–50%
Semantic caching	repeated calls	30–70% on FAQ-shape traffic
Batching	both prices × 0.5	50% for batch-able work
Fine-tuning	both prices	80–95% on narrow tasks

You want to apply as many as fit. The savings compound.

1. Prompt caching (do this first)

Anthropic and OpenAI both support prompt caching: mark a stable prefix; subsequent requests with that prefix get billed at ~10% of input price.

client.messages.create(
    model="claude-sonnet-4-6",
    system=[{
        "type": "text",
        "text": LARGE_SYSTEM_PROMPT,           # 5k tokens
        "cache_control": {"type": "ephemeral"},
    }],
    messages=[{"role": "user", "content": question}],
)

For a chatbot with a 5k-token system prompt + 1k tokens of conversation history × 1000 conversations/day, caching is the difference between $7.50 and $0.75 per day.

Cache up to 4 markers. Place them at: end of system prompt, end of tool definitions, end of static reference docs, end of fixed conversation prefix. Everything after the last marker is dynamic and pays full price.

I covered the mechanics in Anthropic Claude API + Tool Use Guide .

2. Model routing

Most apps use one model for everything. Most apps would do fine using a small model for 80% of traffic and a big model only for the hard 20%.

def pick_model(task_type: str, input_length: int) -> str:
    if task_type == "classification":     return "claude-haiku-4-5"
    if task_type == "extraction":         return "claude-haiku-4-5"
    if task_type == "summarization" and input_length < 5000: return "claude-haiku-4-5"
    if task_type in {"reasoning", "code_review"}: return "claude-opus-4-7"
    return "claude-sonnet-4-6"

Smarter version: a fast classifier (Haiku) routes to the right model for each request. The router cost is negligible; the savings are huge.

For more on routing infrastructure see AI Gateways .

3. Output bounds

client.messages.create(
    ...,
    max_tokens=400,             # not 4096
)

The default max_tokens is generous. Most production responses are 100–500 tokens. Setting max_tokens to the realistic worst case stops a model that decides to write an essay.

For structured output, this is even more important — a tool call rarely needs more than 200 tokens. Set the cap.

4. Semantic caching

For FAQ-shaped traffic (“what’s our refund policy”, “how do I reset my password”), embed the query and look up similar past queries:

async def answer(query: str) -> str:
    embedding = await embed(query)
    similar = await pgvector_search(embedding, threshold=0.95)
    if similar and similar.score > 0.95:
        return similar.cached_response
    response = await llm(query)
    await cache.store(query, embedding, response)
    return response

Be careful: similarity ≠ identity. A 0.95 cutoff is conservative; tune on real data. Don’t cache personalized or context-dependent responses.

Pair with Build a RAG App with pgvector for the embedding infra.

5. Batching

Anthropic Message Batches and OpenAI Batch API both run jobs at 50% discount with 24-hour SLA:

batch = client.messages.batches.create(requests=[
    {"custom_id": "doc-1", "params": {"model": "claude-sonnet-4-6", ...}},
    {"custom_id": "doc-2", "params": {...}},
    # ... up to 10k requests
])
# Poll later

Use cases that fit:

Bulk summarization / extraction over a corpus.
Embedding generation (OpenAI’s embedding API is also batchable).
Eval runs in CI.
Nightly content rewrites.

Half the cost. Trade is the latency.

6. Smaller models, fine-tuned

If you have one task that runs millions of times, a LoRA fine-tune of a 7B-14B beats the 70B prompted equivalent at 10–30× lower cost per call.

A typical math: classification at $0.005/call (Sonnet) × 5M calls/month = $25k. Fine-tune Llama 3.1 8B (one-time $200) → $0.0003/call × 5M = $1.5k/month. Saves $23k/month, payback in days.

See Fine-Tuning vs RAG vs Prompting in 2026 for the decision tree.

7. RAG instead of context stuffing

Instead of loading 50k tokens of “context” on every call, retrieve the relevant 2k:

Approach	Tokens / call
Stuff full corpus	50,000
RAG over corpus	2,000

25× fewer input tokens → 25× lower input cost. See Build a RAG App .

8. Streaming for perceived latency, not cost

Streaming doesn’t reduce cost. But it lets you cap generation early when the user has what they need (stop_sequences). And it lets the user cancel early — which sometimes saves tokens.

async def chat_with_cancel(prompt, cancel_event):
    async with client.messages.stream(...) as stream:
        async for chunk in stream:
            if cancel_event.is_set():
                return
            yield chunk

For mechanics see SSE vs WebSockets in 2026 .

9. Stop tokens

stop_sequences=["\n\nUser:", "\nQuestion:"]

If your prompt template is “User: … Assistant: …”, and the model sometimes hallucinates a follow-up turn, stop tokens cut it. Saves output tokens directly.

10. Distillation

Once you have a working LLM pipeline, log inputs and outputs. Use the logs to fine-tune a smaller model that mimics the big one.

This is distillation: the big model is the teacher, the small one is the student. Combined with #6, you can sometimes get away with a 1B model where you started with a 70B.

11. Eval-driven swaps

A model upgrade looks good in benchmarks. On your eval set, it might be 2% worse. Run evals on every change. Sometimes the cheaper model wins.

12. Cost dashboards

Track per-feature, per-customer, per-route cost. Without this, you’re flying blind. The simplest version:

CREATE TABLE llm_calls (
  id BIGSERIAL PRIMARY KEY,
  ts TIMESTAMPTZ DEFAULT now(),
  feature TEXT,
  customer_id BIGINT,
  model TEXT,
  input_tokens INT,
  output_tokens INT,
  cost_usd NUMERIC(10, 6)
);

Rolled up daily, charted by feature. Now “feature X is suddenly $5k/day” is detectable.

A real-world stack

For a SaaS in 2026 doing 1M LLM calls/day:

Prompt caching for system prompts and tool definitions.
Haiku for 70% of traffic; Sonnet for 25%; Opus for 5%.
200ms max_tokens=300 default; bumped for specific tasks.
pgvector semantic cache for FAQ-shaped questions.
Daily batch summarization runs.
Per-feature cost dashboard.
Quarterly review: which features are eating budget? Which can move to a smaller model?

Without these tactics: $50k/month. With them: $5–10k. Same product.

What I’d do day one

If you have an LLM app and haven’t tried any of this:

Add prompt caching — hours of work.
Set max_tokens honestly — minutes.
Add a per-feature cost log — half a day.
Look at the top feature; can it use a smaller model?

Three tactics in a week, often 60% off the bill.

Read this next

If you want my LLM cost-optimization checklist + cost dashboard SQL, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

The cost shape#

1. Prompt caching (do this first)#

2. Model routing#

3. Output bounds#

4. Semantic caching#

5. Batching#

6. Smaller models, fine-tuned#

7. RAG instead of context stuffing#

8. Streaming for perceived latency, not cost#

9. Stop tokens#

10. Distillation#

11. Eval-driven swaps#

12. Cost dashboards#

A real-world stack#

What I’d do day one#

Read this next#