AI/LLM Cheatsheet 11 — Cost Optimization

LLM cost optimization.

Where money goes

Input tokens (cheap-ish).
Output tokens (3-5x input).
Context window (long inputs add up).
Embeddings (usually cheap).

Right-sizing the model

Task	Model
Quick classification	Haiku, gpt-4o-mini
RAG synthesis	Sonnet, gpt-4o
Complex reasoning	Opus, o-series
Embeddings	small/medium dim

Default to small; escalate.

Prompt caching (Anthropic)

response = client.messages.create(
    model="claude-opus-4-7",
    system=[
        {"type": "text", "text": LONG_INSTRUCTIONS, "cache_control": {"type": "ephemeral"}},
    ],
    messages=[{"role": "user", "content": q}],
)

5min cache TTL. ~10% input cost on cache hits. Huge savings for repeated context.

OpenAI also offers cached input (automatic for gpt-5 etc).

Batch API

OpenAI/Anthropic batch endpoints: 50% off for async batch.

# Submit batch
batch = client.batches.create(
    input_file_id=uploaded_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

Use for: nightly summarization, bulk evals, offline tasks.

Limit output length

max_tokens=300                       # cap output

Most expensive part. Setting tight max avoids runaway answers.

Streaming = no cost change

Streaming feels faster but doesn’t reduce cost. Counts same.

Truncate history

Summarize old turns; keep recent verbatim.

if total_tokens(messages) > 100_000:
    older, recent = messages[:-10], messages[-10:]
    summary = llm(f"Summarize:\n{older}")
    messages = [{"role": "system", "content": f"Earlier: {summary}"}] + recent

Smaller embeddings

text-embedding-3-large with dimensions=512 cheaper than full 3072.

Local models for hot paths

Self-host Llama 4 / Qwen for high-volume cheap operations. Cloud LLM for the hard cases.

Cache LLM responses

Semantic cache: if similar query asked, return cached answer.

def cached_llm(question):
    similar = vector_db.search(embed(question), k=1, threshold=0.95)
    if similar:
        return similar[0].payload["answer"]
    
    answer = llm(question)
    vector_db.insert({"q": question, "answer": answer, "vec": embed(question)})
    return answer

Works for FAQ-style apps; risky for personalized.

Rate limit / token budget per user

def chat(user_id, message):
    if user_tokens[user_id] > daily_limit:
        return "Limit reached"
    response = llm(...)
    user_tokens[user_id] += response.usage.total
    return response

Prevents abuse.

Reasoning models cost

o-series / extended thinking: pay for “thinking” tokens too. Use selectively.

Compare providers regularly

Prices drop. Re-bench yearly.

Token efficiency

Concise system prompts.
Avoid repetition in few-shot.
Use IDs instead of long strings where possible.
JSON shorter than prose for structured data.

Monitor

Log per-request:

Tokens (in, out, total).
Model used.
Cached (yes/no).
Cost.

Dashboard: spend by user, by feature, over time.

Common waste

Sending full chat history every turn.
Including all RAG chunks (top 50 when 5 suffice).
Using premium model for simple tasks.
Re-generating same answer for FAQ.
Output verbosity (“Sure! Here’s a step by step…”).

Where money goes#

Right-sizing the model#

Prompt caching (Anthropic)#

Batch API#

Limit output length#

Streaming = no cost change#

Truncate history#

Smaller embeddings#

Local models for hot paths#

Cache LLM responses#

Rate limit / token budget per user#

Reasoning models cost#

Compare providers regularly#

Token efficiency#

Monitor#

Common waste#

Read this next#