When is vision actually useful?

OCR-replacement, diagram understanding, chart reading, screenshot debugging, document layout. Not for simple cases where structured extraction would work — those are cheaper without vision.

Are vision tokens expensive?

Yes — a single image consumes 1k–3k tokens depending on resolution. Resize / crop before sending; use vision only when text alternatives would lose information.

Multimodal LLMs in 2026 — Vision, Audio, and What's Actually Useful

Multimodal LLMs went from research toy to production tool. Vision in particular is widely useful by 2026. This post is the working set for building with it.

What works well in 2026

Document understanding: PDFs, screenshots, mixed-content pages.
Chart / diagram reading: bar charts, flow diagrams, screenshots of dashboards.
OCR + reasoning: receipts, forms, invoices.
Visual debugging: “What does this UI look broken?”
Audio transcription + understanding: meeting summaries, voice agents.

What still struggles:

Long video with detailed reasoning.
Precise pixel-level localization (use CV models for that).
Real-time streaming inference at low cost (still expensive).

Vision input — the API

import base64

with open("invoice.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_b64}},
            {"type": "text", "text": "Extract line items as JSON."},
        ],
    }],
)

For URLs (some providers): {"type": "image", "source": {"type": "url", "url": "..."}}.

Resize before sending

from PIL import Image

img = Image.open(path)
img.thumbnail((1568, 1568))  # cap dimensions
img.save(buffer, format="JPEG", quality=85)

Bigger image ≠ better answer past a point. Vision tokens scale with size; cap at ~1500×1500 unless you need fine detail.

Document understanding pattern

def extract_invoice(pdf_path):
    pages = pdf_to_images(pdf_path)
    items = []
    for img in pages:
        resp = vision_call(img, "Extract line items: description, qty, unit_price, total")
        items.extend(parse(resp))
    return items

For multi-page PDFs: send each page; aggregate. Or send all pages in one call (within token limits).

For high accuracy: combine vision LLM with traditional OCR (Tesseract / PaddleOCR), feed both to the model.

Audio

# Anthropic / OpenAI: pass audio
resp = client.audio.transcriptions.create(
    model="whisper-1",
    file=open("meeting.mp3", "rb"),
)
transcript = resp.text

# Then reason about it
summary = await llm.complete(f"Summarize this meeting: {transcript}")

Two-stage: transcribe → reason. Cheaper and more controllable than a single model that processes audio directly.

For real-time voice agents: streaming STT (Deepgram, AssemblyAI) → LLM → streaming TTS (ElevenLabs, OpenAI TTS). Latency is still the hard part.

Generation (image-from-text)

# DALL-E / Imagen / Stable Diffusion
resp = openai.images.generate(
    model="dall-e-3",
    prompt="A minimalist diagram of a 3-tier web architecture, isometric, clean lines",
    size="1024x1024",
)
url = resp.data[0].url

Useful for:

Marketing imagery.
UI mockups.
Diagram drafts.

Less useful for: anything brand-consistent (use traditional design); anything legally sensitive (provenance / IP).

Vision RAG

PDF → page-images → embed each via vision-aware embedding → store
Query → embed → retrieve top-k page images → send to vision LLM

Bypasses OCR entirely. Tools like ColPali do this natively. Best for diagrams, formulas, complex layouts where OCR loses information.

Cost reality

Vision is significantly more expensive than text:

A 1024×1024 image ≈ 1.6k input tokens.
Multi-page PDF: 5 pages × 1.6k = 8k input tokens per call.
At Sonnet rates: ~$0.024 per call.

For high-volume document processing: route easy ones to OCR + small text LLM; vision LLM only when text-only fails. See LLM Cost Optimization .

Pitfalls

Hallucinated text: model “reads” text not actually in the image. Validate against OCR or human review for critical docs.
Inconsistent layouts: tables across pages confuse models. Pre-process if you can.
Privacy: images may contain PII / faces / sensitive data. Treat with same care as text PII.

Practical use cases I’ve shipped

Receipt scanning: phone photo → JSON line items. ~95% accuracy with vision LLM, much higher than OCR alone.
Dashboard alerts: screenshot grafana → “what changed?” — surprisingly good for triage.
Form extraction: structured fields out of arbitrary PDFs.
Visual QA: “Did the homepage layout change?” against a baseline screenshot.

When NOT to use vision

Structured data already exists (CSV, JSON). Don’t render to image and re-extract.
Strict accuracy requirements (medical, legal). Use a specialized OCR + verification pipeline.
Cost-sensitive high volume. Text-only is 10× cheaper.

Common mistakes

1. Sending raw 8MP photos

Ten thousand tokens for an image when 1500-token would suffice.

2. No fallback when vision fails

Vision returns nonsense; pipeline blows up. Validate output; fall back to OCR; alert on confidence drops.

3. Trusting extracted text without verification

For high-stakes data: human-in-the-loop on a sample.

4. One-shot for complex docs

Big PDFs: split into pages, process individually, then aggregate.

5. Mixing modalities unnecessarily

Question is text-only? Don’t include the image. Saves tokens.

What I’d ship today

For a doc-understanding feature:

Resize / convert to ~1500px JPEG.
Vision LLM with structured-output schema for extraction.
OCR cross-check on critical fields.
Confidence threshold → human review if low.
Cache by content hash if reprocessing.
Cost monitoring per feature.

Read this next

If you want my multimodal extraction starter (vision + OCR + verification), it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

What works well in 2026#

Vision input — the API#

Resize before sending#

Document understanding pattern#

Audio#

Generation (image-from-text)#

Vision RAG#

Cost reality#

Pitfalls#

Practical use cases I’ve shipped#

When NOT to use vision#

Common mistakes#

1. Sending raw 8MP photos#

2. No fallback when vision fails#

3. Trusting extracted text without verification#

4. One-shot for complex docs#

5. Mixing modalities unnecessarily#

What I’d ship today#

Read this next#