LLM cost optimization.

Where money goes

  • Input tokens (cheap-ish).
  • Output tokens (3-5x input).
  • Context window (long inputs add up).
  • Embeddings (usually cheap).

Right-sizing the model

TaskModel
Quick classificationHaiku, gpt-4o-mini
RAG synthesisSonnet, gpt-4o
Complex reasoningOpus, o-series
Embeddingssmall/medium dim

Default to small; escalate.

Prompt caching (Anthropic)

response = client.messages.create(
    model="claude-opus-4-7",
    system=[
        {"type": "text", "text": LONG_INSTRUCTIONS, "cache_control": {"type": "ephemeral"}},
    ],
    messages=[{"role": "user", "content": q}],
)

5min cache TTL. ~10% input cost on cache hits. Huge savings for repeated context.

OpenAI also offers cached input (automatic for gpt-5 etc).

Batch API

OpenAI/Anthropic batch endpoints: 50% off for async batch.

# Submit batch
batch = client.batches.create(
    input_file_id=uploaded_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

Use for: nightly summarization, bulk evals, offline tasks.

Limit output length

max_tokens=300                       # cap output

Most expensive part. Setting tight max avoids runaway answers.

Streaming = no cost change

Streaming feels faster but doesn’t reduce cost. Counts same.

Truncate history

Summarize old turns; keep recent verbatim.

if total_tokens(messages) > 100_000:
    older, recent = messages[:-10], messages[-10:]
    summary = llm(f"Summarize:\n{older}")
    messages = [{"role": "system", "content": f"Earlier: {summary}"}] + recent

Smaller embeddings

text-embedding-3-large with dimensions=512 cheaper than full 3072.

Local models for hot paths

Self-host Llama 4 / Qwen for high-volume cheap operations. Cloud LLM for the hard cases.

Cache LLM responses

Semantic cache: if similar query asked, return cached answer.

def cached_llm(question):
    similar = vector_db.search(embed(question), k=1, threshold=0.95)
    if similar:
        return similar[0].payload["answer"]
    
    answer = llm(question)
    vector_db.insert({"q": question, "answer": answer, "vec": embed(question)})
    return answer

Works for FAQ-style apps; risky for personalized.

Rate limit / token budget per user

def chat(user_id, message):
    if user_tokens[user_id] > daily_limit:
        return "Limit reached"
    response = llm(...)
    user_tokens[user_id] += response.usage.total
    return response

Prevents abuse.

Reasoning models cost

o-series / extended thinking: pay for “thinking” tokens too. Use selectively.

Compare providers regularly

Prices drop. Re-bench yearly.

Token efficiency

  • Concise system prompts.
  • Avoid repetition in few-shot.
  • Use IDs instead of long strings where possible.
  • JSON shorter than prose for structured data.

Monitor

Log per-request:

  • Tokens (in, out, total).
  • Model used.
  • Cached (yes/no).
  • Cost.

Dashboard: spend by user, by feature, over time.

Common waste

  • Sending full chat history every turn.
  • Including all RAG chunks (top 50 when 5 suffice).
  • Using premium model for simple tasks.
  • Re-generating same answer for FAQ.
  • Output verbosity (“Sure! Here’s a step by step…”).

Read this next

If you want my LLM cost dashboard, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .