Pulling structured data out of PDFs and images used to require a layout model + OCR + post-processing. In 2026, vision LLMs do it in one call. This post is the working pattern.
The stack
| Approach | Cost / page | Best for |
|---|---|---|
| Vision LLM (Claude, GPT-4o, Gemini) | $0.005–0.05 | Variable layouts, mixed content |
| Document Intelligence (Azure, AWS Textract) | $0.005–0.05 | Tables, forms, structured docs |
| Classical OCR + LLM (Tesseract → LLM) | $0.001–0.01 | Stable layouts, high volume |
| Layout-aware (LayoutLMv3, donut) | self-host | Specialized; needs training |
For most 2026 products: Vision LLM with structured-output is the path of least resistance.
Vision LLM extraction
from pydantic import BaseModel, Field
from datetime import date
from decimal import Decimal
class LineItem(BaseModel):
description: str
quantity: int
unit_price: Decimal
total: Decimal
class Invoice(BaseModel):
invoice_number: str
invoice_date: date
vendor_name: str
customer_name: str
line_items: list[LineItem]
subtotal: Decimal
tax: Decimal | None = None
total: Decimal
# Anthropic vision call with tool calling for structured output
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2000,
tools=[{
"name": "extract_invoice",
"input_schema": Invoice.model_json_schema(),
}],
tool_choice={"type": "tool", "name": "extract_invoice"},
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_b64}},
{"type": "text", "text": "Extract the invoice."},
],
}],
)
block = next(b for b in resp.content if b.type == "tool_use")
invoice = Invoice.model_validate(block.input)
For structured output mechanics .
PDF handling
PDFs are images sometimes; selectable text often. Strategy:
- Extract text layer first (PyMuPDF / pdfplumber). Fast, cheap, works for native PDFs.
- Render to image if no text or layout matters. Send to vision LLM.
import fitz # PyMuPDF
doc = fitz.open(pdf_path)
for page_num, page in enumerate(doc):
text = page.get_text()
if text.strip():
process_text_layer(text)
else:
img = page.get_pixmap(dpi=200).tobytes("png")
process_with_vision_llm(img)
Multi-page documents
A 30-page contract:
- Per-page: extract from each page; merge results. Loses cross-page context (continued tables).
- Whole doc: send all pages at once. Eats input tokens; long-context model required. See 1M-Token Context Windows .
- Chunked: chunks of N pages with overlap; merge.
For invoices: per-page works. For contracts: whole-doc or chunked.
Cost discipline
For 1000 invoices/day at $0.02/page:
- Single-page: $20/day = $600/month.
- Multi-page (avg 3): $1800/month.
Batching (Anthropic batch API at 50% off) cuts in half if not real-time.
For very high volume: fine-tune a smaller model on your invoice corpus. See Fine-Tuning vs RAG vs Prompting .
Validation and review
Vision LLMs are good but not perfect. Always:
- Validate critical fields server-side (totals reconcile; dates parse; required fields present).
- Confidence threshold: if extracted total doesn’t match line items × prices, flag for review.
- Human review queue for low-confidence.
- Audit trail of every extraction.
Common patterns
Invoice processing pipeline
Email / upload
↓
Extract attachments / decode
↓
Vision LLM extract structured
↓
Validate (totals, format)
↓
Either auto-process or → review queue
↓
Update accounting system
Contract analysis
PDF
↓
Text + image extraction
↓
Long-context LLM (Opus 4.7 / Gemini 2.5 Pro)
↓
Multi-question extraction (parties, dates, key clauses, risks)
↓
Persist + show summary to user
Receipt scanning
Mobile apps. The LLM call is heavyweight; cache by image hash if users re-upload.
Common mistakes
1. No validation
LLM extracts; you trust. Total is wrong. You bill the customer wrong amount.
2. Throwing big PDFs whole at the LLM
A 100-page PDF in context costs more than necessary if you only need one section. Chunk smartly.
3. No fallback for failures
LLM times out / errors. Either retry, fall back to manual review, or fail the upload gracefully. Don’t show the user a stack trace.
4. Assuming one shot is right
Run evals on your document corpus. Tune the prompt; iterate.
5. Logging documents with PII
Receipts and contracts have PII. Redact before logging or use private observability.
Read this next
- Structured Output for LLMs
- Fine-Tuning vs RAG vs Prompting
- LLM Cost Optimization in 2026
- Anthropic Claude API + Tool Use Guide
If you want my invoice / receipt extraction pipeline, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .