Pulling structured data out of PDFs and images used to require a layout model + OCR + post-processing. In 2026, vision LLMs do it in one call. This post is the working pattern.

The stack

ApproachCost / pageBest for
Vision LLM (Claude, GPT-4o, Gemini)$0.005–0.05Variable layouts, mixed content
Document Intelligence (Azure, AWS Textract)$0.005–0.05Tables, forms, structured docs
Classical OCR + LLM (Tesseract → LLM)$0.001–0.01Stable layouts, high volume
Layout-aware (LayoutLMv3, donut)self-hostSpecialized; needs training

For most 2026 products: Vision LLM with structured-output is the path of least resistance.

Vision LLM extraction

from pydantic import BaseModel, Field
from datetime import date
from decimal import Decimal

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: Decimal
    total: Decimal

class Invoice(BaseModel):
    invoice_number: str
    invoice_date: date
    vendor_name: str
    customer_name: str
    line_items: list[LineItem]
    subtotal: Decimal
    tax: Decimal | None = None
    total: Decimal

# Anthropic vision call with tool calling for structured output
resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2000,
    tools=[{
        "name": "extract_invoice",
        "input_schema": Invoice.model_json_schema(),
    }],
    tool_choice={"type": "tool", "name": "extract_invoice"},
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_b64}},
            {"type": "text", "text": "Extract the invoice."},
        ],
    }],
)

block = next(b for b in resp.content if b.type == "tool_use")
invoice = Invoice.model_validate(block.input)

For structured output mechanics .

PDF handling

PDFs are images sometimes; selectable text often. Strategy:

  1. Extract text layer first (PyMuPDF / pdfplumber). Fast, cheap, works for native PDFs.
  2. Render to image if no text or layout matters. Send to vision LLM.
import fitz  # PyMuPDF

doc = fitz.open(pdf_path)
for page_num, page in enumerate(doc):
    text = page.get_text()
    if text.strip():
        process_text_layer(text)
    else:
        img = page.get_pixmap(dpi=200).tobytes("png")
        process_with_vision_llm(img)

Multi-page documents

A 30-page contract:

  • Per-page: extract from each page; merge results. Loses cross-page context (continued tables).
  • Whole doc: send all pages at once. Eats input tokens; long-context model required. See 1M-Token Context Windows .
  • Chunked: chunks of N pages with overlap; merge.

For invoices: per-page works. For contracts: whole-doc or chunked.

Cost discipline

For 1000 invoices/day at $0.02/page:

  • Single-page: $20/day = $600/month.
  • Multi-page (avg 3): $1800/month.

Batching (Anthropic batch API at 50% off) cuts in half if not real-time.

For very high volume: fine-tune a smaller model on your invoice corpus. See Fine-Tuning vs RAG vs Prompting .

Validation and review

Vision LLMs are good but not perfect. Always:

  • Validate critical fields server-side (totals reconcile; dates parse; required fields present).
  • Confidence threshold: if extracted total doesn’t match line items × prices, flag for review.
  • Human review queue for low-confidence.
  • Audit trail of every extraction.

Common patterns

Invoice processing pipeline

Email / upload
  
Extract attachments / decode
  
Vision LLM extract structured
  
Validate (totals, format)
  
Either auto-process or  review queue
  
Update accounting system

Contract analysis

PDF
Text + image extraction
Long-context LLM (Opus 4.7 / Gemini 2.5 Pro)
Multi-question extraction (parties, dates, key clauses, risks)
Persist + show summary to user

Receipt scanning

Mobile apps. The LLM call is heavyweight; cache by image hash if users re-upload.

Common mistakes

1. No validation

LLM extracts; you trust. Total is wrong. You bill the customer wrong amount.

2. Throwing big PDFs whole at the LLM

A 100-page PDF in context costs more than necessary if you only need one section. Chunk smartly.

3. No fallback for failures

LLM times out / errors. Either retry, fall back to manual review, or fail the upload gracefully. Don’t show the user a stack trace.

4. Assuming one shot is right

Run evals on your document corpus. Tune the prompt; iterate.

5. Logging documents with PII

Receipts and contracts have PII. Redact before logging or use private observability.

Read this next

If you want my invoice / receipt extraction pipeline, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .