Vision LLMs or classical OCR + structured extraction?

Vision LLMs for variable-layout documents (invoices, receipts, contracts) — they handle layout, language, structure together. Classical OCR + extraction when document layouts are stable and you need maximum cost efficiency at scale.

What's the accuracy on invoices?

Frontier vision LLMs hit 95%+ field accuracy on typical invoices with structured-output prompting. The remaining 5% is edge cases (handwritten, unusual layouts, multi-page) that need eval-driven iteration.

Document AI in 2026 — Extracting Structured Data from PDFs and Images

Pulling structured data out of PDFs and images used to require a layout model + OCR + post-processing. In 2026, vision LLMs do it in one call. This post is the working pattern.

The stack

Approach	Cost / page	Best for
Vision LLM (Claude, GPT-4o, Gemini)	$0.005–0.05	Variable layouts, mixed content
Document Intelligence (Azure, AWS Textract)	$0.005–0.05	Tables, forms, structured docs
Classical OCR + LLM (Tesseract → LLM)	$0.001–0.01	Stable layouts, high volume
Layout-aware (LayoutLMv3, donut)	self-host	Specialized; needs training

For most 2026 products: Vision LLM with structured-output is the path of least resistance.

Vision LLM extraction

from pydantic import BaseModel, Field
from datetime import date
from decimal import Decimal

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: Decimal
    total: Decimal

class Invoice(BaseModel):
    invoice_number: str
    invoice_date: date
    vendor_name: str
    customer_name: str
    line_items: list[LineItem]
    subtotal: Decimal
    tax: Decimal | None = None
    total: Decimal

# Anthropic vision call with tool calling for structured output
resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2000,
    tools=[{
        "name": "extract_invoice",
        "input_schema": Invoice.model_json_schema(),
    }],
    tool_choice={"type": "tool", "name": "extract_invoice"},
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_b64}},
            {"type": "text", "text": "Extract the invoice."},
        ],
    }],
)

block = next(b for b in resp.content if b.type == "tool_use")
invoice = Invoice.model_validate(block.input)

For structured output mechanics .

PDF handling

PDFs are images sometimes; selectable text often. Strategy:

Extract text layer first (PyMuPDF / pdfplumber). Fast, cheap, works for native PDFs.
Render to image if no text or layout matters. Send to vision LLM.

import fitz  # PyMuPDF

doc = fitz.open(pdf_path)
for page_num, page in enumerate(doc):
    text = page.get_text()
    if text.strip():
        process_text_layer(text)
    else:
        img = page.get_pixmap(dpi=200).tobytes("png")
        process_with_vision_llm(img)

Multi-page documents

A 30-page contract:

Per-page: extract from each page; merge results. Loses cross-page context (continued tables).
Whole doc: send all pages at once. Eats input tokens; long-context model required. See 1M-Token Context Windows .
Chunked: chunks of N pages with overlap; merge.

For invoices: per-page works. For contracts: whole-doc or chunked.

Cost discipline

For 1000 invoices/day at $0.02/page:

Single-page: $20/day = $600/month.
Multi-page (avg 3): $1800/month.

Batching (Anthropic batch API at 50% off) cuts in half if not real-time.

For very high volume: fine-tune a smaller model on your invoice corpus. See Fine-Tuning vs RAG vs Prompting .

Validation and review

Vision LLMs are good but not perfect. Always:

Validate critical fields server-side (totals reconcile; dates parse; required fields present).
Confidence threshold: if extracted total doesn’t match line items × prices, flag for review.
Human review queue for low-confidence.
Audit trail of every extraction.

Common patterns

Invoice processing pipeline

Email / upload
  ↓
Extract attachments / decode
  ↓
Vision LLM extract structured
  ↓
Validate (totals, format)
  ↓
Either auto-process or → review queue
  ↓
Update accounting system

Contract analysis

PDF
  ↓
Text + image extraction
  ↓
Long-context LLM (Opus 4.7 / Gemini 2.5 Pro)
  ↓
Multi-question extraction (parties, dates, key clauses, risks)
  ↓
Persist + show summary to user

Receipt scanning

Mobile apps. The LLM call is heavyweight; cache by image hash if users re-upload.

Common mistakes

1. No validation

LLM extracts; you trust. Total is wrong. You bill the customer wrong amount.

2. Throwing big PDFs whole at the LLM

A 100-page PDF in context costs more than necessary if you only need one section. Chunk smartly.

3. No fallback for failures

LLM times out / errors. Either retry, fall back to manual review, or fail the upload gracefully. Don’t show the user a stack trace.

4. Assuming one shot is right

Run evals on your document corpus. Tune the prompt; iterate.

5. Logging documents with PII

Receipts and contracts have PII. Redact before logging or use private observability.

Read this next

If you want my invoice / receipt extraction pipeline, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

The stack#

Vision LLM extraction#

PDF handling#

Multi-page documents#

Cost discipline#

Validation and review#

Common patterns#

Invoice processing pipeline#

Contract analysis#

Receipt scanning#

Common mistakes#

1. No validation#

2. Throwing big PDFs whole at the LLM#

3. No fallback for failures#

4. Assuming one shot is right#

5. Logging documents with PII#

Read this next#