LLM cost optimization.
Where money goes
- Input tokens (cheap-ish).
- Output tokens (3-5x input).
- Context window (long inputs add up).
- Embeddings (usually cheap).
Right-sizing the model
| Task | Model |
|---|---|
| Quick classification | Haiku, gpt-4o-mini |
| RAG synthesis | Sonnet, gpt-4o |
| Complex reasoning | Opus, o-series |
| Embeddings | small/medium dim |
Default to small; escalate.
Prompt caching (Anthropic)
response = client.messages.create(
model="claude-opus-4-7",
system=[
{"type": "text", "text": LONG_INSTRUCTIONS, "cache_control": {"type": "ephemeral"}},
],
messages=[{"role": "user", "content": q}],
)
5min cache TTL. ~10% input cost on cache hits. Huge savings for repeated context.
OpenAI also offers cached input (automatic for gpt-5 etc).
Batch API
OpenAI/Anthropic batch endpoints: 50% off for async batch.
# Submit batch
batch = client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
Use for: nightly summarization, bulk evals, offline tasks.
Limit output length
max_tokens=300 # cap output
Most expensive part. Setting tight max avoids runaway answers.
Streaming = no cost change
Streaming feels faster but doesn’t reduce cost. Counts same.
Truncate history
Summarize old turns; keep recent verbatim.
if total_tokens(messages) > 100_000:
older, recent = messages[:-10], messages[-10:]
summary = llm(f"Summarize:\n{older}")
messages = [{"role": "system", "content": f"Earlier: {summary}"}] + recent
Smaller embeddings
text-embedding-3-large with dimensions=512 cheaper than full 3072.
Local models for hot paths
Self-host Llama 4 / Qwen for high-volume cheap operations. Cloud LLM for the hard cases.
Cache LLM responses
Semantic cache: if similar query asked, return cached answer.
def cached_llm(question):
similar = vector_db.search(embed(question), k=1, threshold=0.95)
if similar:
return similar[0].payload["answer"]
answer = llm(question)
vector_db.insert({"q": question, "answer": answer, "vec": embed(question)})
return answer
Works for FAQ-style apps; risky for personalized.
Rate limit / token budget per user
def chat(user_id, message):
if user_tokens[user_id] > daily_limit:
return "Limit reached"
response = llm(...)
user_tokens[user_id] += response.usage.total
return response
Prevents abuse.
Reasoning models cost
o-series / extended thinking: pay for “thinking” tokens too. Use selectively.
Compare providers regularly
Prices drop. Re-bench yearly.
Token efficiency
- Concise system prompts.
- Avoid repetition in few-shot.
- Use IDs instead of long strings where possible.
- JSON shorter than prose for structured data.
Monitor
Log per-request:
- Tokens (in, out, total).
- Model used.
- Cached (yes/no).
- Cost.
Dashboard: spend by user, by feature, over time.
Common waste
- Sending full chat history every turn.
- Including all RAG chunks (top 50 when 5 suffice).
- Using premium model for simple tasks.
- Re-generating same answer for FAQ.
- Output verbosity (“Sure! Here’s a step by step…”).
Read this next
If you want my LLM cost dashboard, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .