If your LLM job doesn’t need to finish in seconds, you’re probably paying double. Batch APIs trade latency for 50% off. This post is the working playbook.
When batch makes sense
- Nightly eval runs.
- Bulk content generation (email subject lines, descriptions for a catalog).
- Classification of historical data.
- Embedding many documents.
- Data enrichment / cleanup.
- Any LLM job where 24h is fine.
When it doesn’t:
- Interactive chat / search.
- Streaming responses needed.
- Sub-minute latency required.
Anthropic batches
batch = await client.messages.batches.create(
requests=[
{
"custom_id": f"req-{i}",
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"messages": [{"role": "user", "content": items[i].text}],
},
}
for i in range(10000)
]
)
print(batch.id)
Submit. Poll until done:
async def wait_for_batch(batch_id):
while True:
batch = await client.messages.batches.retrieve(batch_id)
if batch.processing_status == "ended":
return batch
await asyncio.sleep(60)
batch = await wait_for_batch(batch.id)
results = [r async for r in client.messages.batches.results(batch.id)]
Each result has the custom_id so you can correlate to your job IDs.
OpenAI batches
import json
# 1. Create JSONL file
with open("batch.jsonl", "w") as f:
for i, item in enumerate(items):
f.write(json.dumps({
"custom_id": f"req-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4.1-mini",
"messages": [{"role": "user", "content": item.text}],
"max_tokens": 1024,
},
}) + "\n")
# 2. Upload + create
file = await openai.files.create(file=open("batch.jsonl", "rb"), purpose="batch")
batch = await openai.batches.create(
input_file_id=file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
Same pattern: poll, retrieve results.
Combine with prompt caching
batch = await client.messages.batches.create(
requests=[
{
"custom_id": f"req-{i}",
"params": {
"model": "claude-sonnet-4-6",
"system": [
{"type": "text", "text": HUGE_SYSTEM, "cache_control": {"type": "ephemeral"}}
],
"messages": [{"role": "user", "content": items[i].text}],
"max_tokens": 1024,
},
}
for i in range(10000)
]
)
System prompt cached across batch requests. Combined with batch’s 50% discount: significant compounded savings.
Error handling
Some requests in a batch may fail individually. Each result has a result.type:
async for r in client.messages.batches.results(batch_id):
if r.result.type == "succeeded":
process(r.custom_id, r.result.message)
elif r.result.type == "errored":
log.error(f"{r.custom_id} failed: {r.result.error}")
# retry individually or queue for manual review
elif r.result.type == "canceled":
log.warning(f"{r.custom_id} canceled")
Plan for partial failures.
Queueing pattern
For ongoing batch jobs (e.g., enrich every new product description):
async def queue_for_batch(item_id, prompt):
await db.execute(
"INSERT INTO llm_batch_queue (item_id, prompt, status) VALUES ($1, $2, 'pending')",
item_id, prompt
)
# Hourly job
async def submit_pending_batch():
pending = await db.fetch(
"SELECT id, item_id, prompt FROM llm_batch_queue WHERE status = 'pending' LIMIT 50000"
)
if not pending: return
batch = await client.messages.batches.create(
requests=[{"custom_id": f"row-{p.id}", "params": {...}} for p in pending]
)
await db.execute(
"UPDATE llm_batch_queue SET status = 'submitted', batch_id = $1 WHERE id = ANY($2)",
batch.id, [p.id for p in pending]
)
Periodic submitter; periodic poller; periodic processor of results.
Combining batch and online
Common pattern: critical path uses online API; backfill / bulk uses batch.
async def enrich(product):
if product.urgency == "high":
return await online_llm(product)
# otherwise queue for nightly batch
await queue_for_batch(product.id, build_prompt(product))
The 80% non-urgent path saves 50%; the 20% urgent path keeps low latency.
Limits
- Anthropic: 100k requests per batch, 256MB total.
- OpenAI: 50k requests per batch, 100MB total.
- Bedrock: similar; varies by model.
Split larger jobs into multiple batches.
Monitoring
Track:
- Submission rate vs queue depth.
- Time to completion (sub-1h common; up to 24h cap).
- Failure rate per batch.
- Cost per batch.
Alert if a batch sits >25h (probably stuck).
Beyond batch APIs
For really high-volume work that doesn’t fit batch APIs:
- Self-hosted vLLM at sustained load is cheaper than even batch API.
- Fine-tuned smaller model on cheap inference.
Batch APIs are the sweet spot for “moderate volume + latency tolerant” work.
Common mistakes
1. Synchronous online calls for nightly jobs
The job takes 8h and costs $5k. Same job on batch API: 1h queue + 12h delivery + $2.5k.
2. Tiny batches
Submitting batches of 5 requests defeats the purpose. Aim for hundreds-to-thousands per batch.
3. No correlation IDs
Forget custom_id; can’t match results to inputs. Always set IDs.
4. Polling too often
Polling every 5s for a 6h job. Wasteful. 1–5 min interval is fine.
5. Treating partial failures as total failures
Batch has 1% failures; entire batch retried. Process the 99% successes; retry only failures.
What I’d ship today
For a fresh LLM-using app with bulk needs:
- Eval runs: batch.
- Nightly enrichment: batch.
- User-facing chat: online with caching.
- Mixed: queue non-urgent → hourly batch submitter.
- Monitoring on batch state.
- Per-user / per-feature spend caps.
Read this next
If you want my batch API queue + processor reference, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .