If your LLM job doesn’t need to finish in seconds, you’re probably paying double. Batch APIs trade latency for 50% off. This post is the working playbook.

When batch makes sense

  • Nightly eval runs.
  • Bulk content generation (email subject lines, descriptions for a catalog).
  • Classification of historical data.
  • Embedding many documents.
  • Data enrichment / cleanup.
  • Any LLM job where 24h is fine.

When it doesn’t:

  • Interactive chat / search.
  • Streaming responses needed.
  • Sub-minute latency required.

Anthropic batches

batch = await client.messages.batches.create(
    requests=[
        {
            "custom_id": f"req-{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": items[i].text}],
            },
        }
        for i in range(10000)
    ]
)
print(batch.id)

Submit. Poll until done:

async def wait_for_batch(batch_id):
    while True:
        batch = await client.messages.batches.retrieve(batch_id)
        if batch.processing_status == "ended":
            return batch
        await asyncio.sleep(60)

batch = await wait_for_batch(batch.id)
results = [r async for r in client.messages.batches.results(batch.id)]

Each result has the custom_id so you can correlate to your job IDs.

OpenAI batches

import json

# 1. Create JSONL file
with open("batch.jsonl", "w") as f:
    for i, item in enumerate(items):
        f.write(json.dumps({
            "custom_id": f"req-{i}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4.1-mini",
                "messages": [{"role": "user", "content": item.text}],
                "max_tokens": 1024,
            },
        }) + "\n")

# 2. Upload + create
file = await openai.files.create(file=open("batch.jsonl", "rb"), purpose="batch")
batch = await openai.batches.create(
    input_file_id=file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

Same pattern: poll, retrieve results.

Combine with prompt caching

batch = await client.messages.batches.create(
    requests=[
        {
            "custom_id": f"req-{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "system": [
                    {"type": "text", "text": HUGE_SYSTEM, "cache_control": {"type": "ephemeral"}}
                ],
                "messages": [{"role": "user", "content": items[i].text}],
                "max_tokens": 1024,
            },
        }
        for i in range(10000)
    ]
)

System prompt cached across batch requests. Combined with batch’s 50% discount: significant compounded savings.

Error handling

Some requests in a batch may fail individually. Each result has a result.type:

async for r in client.messages.batches.results(batch_id):
    if r.result.type == "succeeded":
        process(r.custom_id, r.result.message)
    elif r.result.type == "errored":
        log.error(f"{r.custom_id} failed: {r.result.error}")
        # retry individually or queue for manual review
    elif r.result.type == "canceled":
        log.warning(f"{r.custom_id} canceled")

Plan for partial failures.

Queueing pattern

For ongoing batch jobs (e.g., enrich every new product description):

async def queue_for_batch(item_id, prompt):
    await db.execute(
        "INSERT INTO llm_batch_queue (item_id, prompt, status) VALUES ($1, $2, 'pending')",
        item_id, prompt
    )

# Hourly job
async def submit_pending_batch():
    pending = await db.fetch(
        "SELECT id, item_id, prompt FROM llm_batch_queue WHERE status = 'pending' LIMIT 50000"
    )
    if not pending: return
    
    batch = await client.messages.batches.create(
        requests=[{"custom_id": f"row-{p.id}", "params": {...}} for p in pending]
    )
    
    await db.execute(
        "UPDATE llm_batch_queue SET status = 'submitted', batch_id = $1 WHERE id = ANY($2)",
        batch.id, [p.id for p in pending]
    )

Periodic submitter; periodic poller; periodic processor of results.

Combining batch and online

Common pattern: critical path uses online API; backfill / bulk uses batch.

async def enrich(product):
    if product.urgency == "high":
        return await online_llm(product)
    
    # otherwise queue for nightly batch
    await queue_for_batch(product.id, build_prompt(product))

The 80% non-urgent path saves 50%; the 20% urgent path keeps low latency.

Limits

  • Anthropic: 100k requests per batch, 256MB total.
  • OpenAI: 50k requests per batch, 100MB total.
  • Bedrock: similar; varies by model.

Split larger jobs into multiple batches.

Monitoring

Track:

  • Submission rate vs queue depth.
  • Time to completion (sub-1h common; up to 24h cap).
  • Failure rate per batch.
  • Cost per batch.

Alert if a batch sits >25h (probably stuck).

Beyond batch APIs

For really high-volume work that doesn’t fit batch APIs:

  • Self-hosted vLLM at sustained load is cheaper than even batch API.
  • Fine-tuned smaller model on cheap inference.

Batch APIs are the sweet spot for “moderate volume + latency tolerant” work.

Common mistakes

1. Synchronous online calls for nightly jobs

The job takes 8h and costs $5k. Same job on batch API: 1h queue + 12h delivery + $2.5k.

2. Tiny batches

Submitting batches of 5 requests defeats the purpose. Aim for hundreds-to-thousands per batch.

3. No correlation IDs

Forget custom_id; can’t match results to inputs. Always set IDs.

4. Polling too often

Polling every 5s for a 6h job. Wasteful. 1–5 min interval is fine.

5. Treating partial failures as total failures

Batch has 1% failures; entire batch retried. Process the 99% successes; retry only failures.

What I’d ship today

For a fresh LLM-using app with bulk needs:

  • Eval runs: batch.
  • Nightly enrichment: batch.
  • User-facing chat: online with caching.
  • Mixed: queue non-urgent → hourly batch submitter.
  • Monitoring on batch state.
  • Per-user / per-feature spend caps.

Read this next

If you want my batch API queue + processor reference, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .