Anthropic’s API has features that compound. Used together, you get fast, cheap, reliable Claude integrations. Used separately, you leave money and quality on the table. This post is the working production playbook.
Prompt caching
resp = client.messages.create(
model="claude-sonnet-4-6",
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": user_query}],
)
System prompt marked cacheable. Subsequent calls within 5min: ~10% of input cost.
For RAG: cache the retrieved chunks too (separate cache_control). Massive savings.
system=[
{"type": "text", "text": SYS_PROMPT, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": RETRIEVED_DOCS, "cache_control": {"type": "ephemeral"}},
]
Up to 4 cache breakpoints. Use them for stable layers.
See LLM Cost Optimization .
Streaming
async with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}],
) as stream:
async for text in stream.text_stream:
yield text
Stream text as it generates. UX win. See FastAPI Streaming .
Tool use loop
async def run(messages, tools, max_iters=10):
for _ in range(max_iters):
resp = await client.messages.create(
model="claude-sonnet-4-6",
messages=messages,
tools=tools,
max_tokens=4096,
)
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason == "end_turn":
return resp
tool_results = []
for block in resp.content:
if block.type == "tool_use":
try:
result = await dispatch(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result),
})
except Exception as e:
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(e),
"is_error": True,
})
messages.append({"role": "user", "content": tool_results})
raise MaxItersReached()
See LLM Tool Use Patterns .
Structured output via tools
from pydantic import BaseModel
class ResponseSchema(BaseModel):
summary: str
sentiment: str
confidence: float
resp = client.messages.create(
model="claude-sonnet-4-6",
tools=[{"name": "respond", "input_schema": ResponseSchema.model_json_schema()}],
tool_choice={"type": "tool", "name": "respond"},
messages=[{"role": "user", "content": "Analyze: ..."}],
)
# Parse
for block in resp.content:
if block.type == "tool_use":
result = ResponseSchema(**block.input)
Tool calling = enforced schema. Cleaner than “respond as JSON” prayers. See Structured Output .
Extended thinking
For hard reasoning:
resp = client.messages.create(
model="claude-sonnet-4-6",
thinking={"type": "enabled", "budget_tokens": 5000},
max_tokens=8000,
messages=[{"role": "user", "content": hard_problem}],
)
# Thinking blocks are returned separately
for block in resp.content:
if block.type == "thinking":
print("(reasoning)", block.thinking)
elif block.type == "text":
print(block.text)
Better answers on hard problems. Costs more. Use selectively.
Batch API
batch = await client.messages.batches.create(
requests=[
{"custom_id": f"req-{i}", "params": {"model": "claude-sonnet-4-6", "max_tokens": 1024,
"messages": [{"role": "user", "content": items[i].text}]}}
for i in range(10000)
]
)
50% off; up to 24h delivery. For evals, content gen, classification: free money.
See LLM Batch Processing .
Vision
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_b64}},
{"type": "text", "text": "Extract line items from this invoice"},
],
}]
)
See Multimodal LLMs .
Retries and rate limits
import anthropic
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential
@retry(
retry=retry_if_exception_type((anthropic.RateLimitError, anthropic.APIConnectionError, anthropic.InternalServerError)),
wait=wait_exponential(multiplier=1, min=2, max=60),
stop=stop_after_attempt(5),
)
async def call(messages):
return await client.messages.create(...)
Don’t retry on 4xx (your bug); retry on 429 / 5xx (server / rate limit).
Token counting
count = await client.messages.count_tokens(
model="claude-sonnet-4-6",
messages=[...],
)
print(count.input_tokens)
Before calling, count to estimate cost / fit context window.
Streaming with tools
async with client.messages.stream(
model="claude-sonnet-4-6",
tools=[...],
messages=messages,
) as stream:
async for event in stream:
if event.type == "content_block_start":
if event.content_block.type == "tool_use":
# Tool call coming
pass
elif event.type == "content_block_delta":
if event.delta.type == "text_delta":
yield event.delta.text
Stream tokens; track tool use as it builds.
System prompt design
You are <role>. <core capabilities>. <output format>.
When responding:
- Be concise; 2-3 sentences max.
- Use plain language; no jargon.
- If uncertain, say so explicitly.
Tool use:
- Use search_docs for factual questions.
- Use create_ticket for support issues.
Concise. Specific. The model follows clear instructions; vague ones produce vague output.
See Prompt Engineering 2026 .
Cost monitoring
async def log_call(resp):
await metrics.record(
feature=current_feature(),
input_tokens=resp.usage.input_tokens,
cache_read=resp.usage.cache_read_input_tokens,
cache_create=resp.usage.cache_creation_input_tokens,
output_tokens=resp.usage.output_tokens,
cost_usd=compute_cost(resp.usage, resp.model),
)
Tag every call. Aggregate by feature. Find your top spenders.
See LLM Observability .
Common mistakes
1. No prompt caching
Stable system prompt re-sent at full price every call. Add cache_control.
2. Sonnet for trivial classification
Use Haiku. 5× cheaper; comparable on simple tasks.
3. No streaming for chat UIs
User waits 5s for full response. Stream.
4. Tool use without max iters
Bug → infinite loop → bill explodes. Always cap.
5. Ignoring rate limits
429 → crash; retry storm. tenacity + exponential backoff.
What I’d ship today
For new Claude integrations:
- Prompt caching universally.
- Sonnet 4.6 default; Haiku for trivial.
- Streaming for chat / interactive.
- Tool use with schemas + max iters.
- Batch API for non-realtime.
- Tenacity retries on transient errors.
- Per-feature cost tracking.
- Tracing via Langfuse / OTEL.
Read this next
If you want my Anthropic API production starter (caching, tools, streaming, retries), it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .