What's the single biggest perf win on the Anthropic API?

Prompt caching. For RAG / long-context apps, marking your stable system prompt + retrieved context as cacheable cuts cost 80-90% on subsequent calls within 5min. Free win.

Sonnet or Haiku in 2026?

Haiku for trivial classification / extraction. Sonnet for everything else. Opus for the hardest reasoning tasks where cost is justified. Most apps: Sonnet default + Haiku for routing/classifiers.

Anthropic API Best Practices in 2026 — Caching, Tool Use, Streaming, and Production Patterns

Anthropic’s API has features that compound. Used together, you get fast, cheap, reliable Claude integrations. Used separately, you leave money and quality on the table. This post is the working production playbook.

Prompt caching

resp = client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": user_query}],
)

System prompt marked cacheable. Subsequent calls within 5min: ~10% of input cost.

For RAG: cache the retrieved chunks too (separate cache_control). Massive savings.

system=[
    {"type": "text", "text": SYS_PROMPT, "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": RETRIEVED_DOCS, "cache_control": {"type": "ephemeral"}},
]

Up to 4 cache breakpoints. Use them for stable layers.

See LLM Cost Optimization .

Streaming

async with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[{"role": "user", "content": prompt}],
) as stream:
    async for text in stream.text_stream:
        yield text

Stream text as it generates. UX win. See FastAPI Streaming .

Tool use loop

async def run(messages, tools, max_iters=10):
    for _ in range(max_iters):
        resp = await client.messages.create(
            model="claude-sonnet-4-6",
            messages=messages,
            tools=tools,
            max_tokens=4096,
        )
        messages.append({"role": "assistant", "content": resp.content})
        
        if resp.stop_reason == "end_turn":
            return resp
        
        tool_results = []
        for block in resp.content:
            if block.type == "tool_use":
                try:
                    result = await dispatch(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": json.dumps(result),
                    })
                except Exception as e:
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": str(e),
                        "is_error": True,
                    })
        
        messages.append({"role": "user", "content": tool_results})
    
    raise MaxItersReached()

See LLM Tool Use Patterns .

Structured output via tools

from pydantic import BaseModel

class ResponseSchema(BaseModel):
    summary: str
    sentiment: str
    confidence: float

resp = client.messages.create(
    model="claude-sonnet-4-6",
    tools=[{"name": "respond", "input_schema": ResponseSchema.model_json_schema()}],
    tool_choice={"type": "tool", "name": "respond"},
    messages=[{"role": "user", "content": "Analyze: ..."}],
)

# Parse
for block in resp.content:
    if block.type == "tool_use":
        result = ResponseSchema(**block.input)

Tool calling = enforced schema. Cleaner than “respond as JSON” prayers. See Structured Output .

Extended thinking

For hard reasoning:

resp = client.messages.create(
    model="claude-sonnet-4-6",
    thinking={"type": "enabled", "budget_tokens": 5000},
    max_tokens=8000,
    messages=[{"role": "user", "content": hard_problem}],
)

# Thinking blocks are returned separately
for block in resp.content:
    if block.type == "thinking":
        print("(reasoning)", block.thinking)
    elif block.type == "text":
        print(block.text)

Better answers on hard problems. Costs more. Use selectively.

Batch API

batch = await client.messages.batches.create(
    requests=[
        {"custom_id": f"req-{i}", "params": {"model": "claude-sonnet-4-6", "max_tokens": 1024,
                                              "messages": [{"role": "user", "content": items[i].text}]}}
        for i in range(10000)
    ]
)

50% off; up to 24h delivery. For evals, content gen, classification: free money.

See LLM Batch Processing .

Vision

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_b64}},
            {"type": "text", "text": "Extract line items from this invoice"},
        ],
    }]
)

See Multimodal LLMs .

Retries and rate limits

import anthropic
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential

@retry(
    retry=retry_if_exception_type((anthropic.RateLimitError, anthropic.APIConnectionError, anthropic.InternalServerError)),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    stop=stop_after_attempt(5),
)
async def call(messages):
    return await client.messages.create(...)

Don’t retry on 4xx (your bug); retry on 429 / 5xx (server / rate limit).

Token counting

count = await client.messages.count_tokens(
    model="claude-sonnet-4-6",
    messages=[...],
)
print(count.input_tokens)

Before calling, count to estimate cost / fit context window.

Streaming with tools

async with client.messages.stream(
    model="claude-sonnet-4-6",
    tools=[...],
    messages=messages,
) as stream:
    async for event in stream:
        if event.type == "content_block_start":
            if event.content_block.type == "tool_use":
                # Tool call coming
                pass
        elif event.type == "content_block_delta":
            if event.delta.type == "text_delta":
                yield event.delta.text

Stream tokens; track tool use as it builds.

System prompt design

You are <role>. <core capabilities>. <output format>.

When responding:
- Be concise; 2-3 sentences max.
- Use plain language; no jargon.
- If uncertain, say so explicitly.

Tool use:
- Use search_docs for factual questions.
- Use create_ticket for support issues.

Concise. Specific. The model follows clear instructions; vague ones produce vague output.

See Prompt Engineering 2026 .

Cost monitoring

async def log_call(resp):
    await metrics.record(
        feature=current_feature(),
        input_tokens=resp.usage.input_tokens,
        cache_read=resp.usage.cache_read_input_tokens,
        cache_create=resp.usage.cache_creation_input_tokens,
        output_tokens=resp.usage.output_tokens,
        cost_usd=compute_cost(resp.usage, resp.model),
    )

Tag every call. Aggregate by feature. Find your top spenders.

See LLM Observability .

Common mistakes

1. No prompt caching

Stable system prompt re-sent at full price every call. Add cache_control.

2. Sonnet for trivial classification

Use Haiku. 5× cheaper; comparable on simple tasks.

3. No streaming for chat UIs

User waits 5s for full response. Stream.

4. Tool use without max iters

Bug → infinite loop → bill explodes. Always cap.

5. Ignoring rate limits

429 → crash; retry storm. tenacity + exponential backoff.

What I’d ship today

For new Claude integrations:

Prompt caching universally.
Sonnet 4.6 default; Haiku for trivial.
Streaming for chat / interactive.
Tool use with schemas + max iters.
Batch API for non-realtime.
Tenacity retries on transient errors.
Per-feature cost tracking.
Tracing via Langfuse / OTEL.

Read this next

If you want my Anthropic API production starter (caching, tools, streaming, retries), it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

Prompt caching#

Streaming#

Tool use loop#

Structured output via tools#

Extended thinking#

Batch API#

Vision#

Retries and rate limits#

Token counting#

Streaming with tools#

System prompt design#

Cost monitoring#

Common mistakes#

1. No prompt caching#

2. Sonnet for trivial classification#

3. No streaming for chat UIs#

4. Tool use without max iters#

5. Ignoring rate limits#

What I’d ship today#

Read this next#