Multimodal LLMs went from research toy to production tool. Vision in particular is widely useful by 2026. This post is the working set for building with it.
What works well in 2026
- Document understanding: PDFs, screenshots, mixed-content pages.
- Chart / diagram reading: bar charts, flow diagrams, screenshots of dashboards.
- OCR + reasoning: receipts, forms, invoices.
- Visual debugging: “What does this UI look broken?”
- Audio transcription + understanding: meeting summaries, voice agents.
What still struggles:
- Long video with detailed reasoning.
- Precise pixel-level localization (use CV models for that).
- Real-time streaming inference at low cost (still expensive).
Vision input — the API
import base64
with open("invoice.png", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_b64}},
{"type": "text", "text": "Extract line items as JSON."},
],
}],
)
For URLs (some providers): {"type": "image", "source": {"type": "url", "url": "..."}}.
Resize before sending
from PIL import Image
img = Image.open(path)
img.thumbnail((1568, 1568)) # cap dimensions
img.save(buffer, format="JPEG", quality=85)
Bigger image ≠ better answer past a point. Vision tokens scale with size; cap at ~1500×1500 unless you need fine detail.
Document understanding pattern
def extract_invoice(pdf_path):
pages = pdf_to_images(pdf_path)
items = []
for img in pages:
resp = vision_call(img, "Extract line items: description, qty, unit_price, total")
items.extend(parse(resp))
return items
For multi-page PDFs: send each page; aggregate. Or send all pages in one call (within token limits).
For high accuracy: combine vision LLM with traditional OCR (Tesseract / PaddleOCR), feed both to the model.
Audio
# Anthropic / OpenAI: pass audio
resp = client.audio.transcriptions.create(
model="whisper-1",
file=open("meeting.mp3", "rb"),
)
transcript = resp.text
# Then reason about it
summary = await llm.complete(f"Summarize this meeting: {transcript}")
Two-stage: transcribe → reason. Cheaper and more controllable than a single model that processes audio directly.
For real-time voice agents: streaming STT (Deepgram, AssemblyAI) → LLM → streaming TTS (ElevenLabs, OpenAI TTS). Latency is still the hard part.
Generation (image-from-text)
# DALL-E / Imagen / Stable Diffusion
resp = openai.images.generate(
model="dall-e-3",
prompt="A minimalist diagram of a 3-tier web architecture, isometric, clean lines",
size="1024x1024",
)
url = resp.data[0].url
Useful for:
- Marketing imagery.
- UI mockups.
- Diagram drafts.
Less useful for: anything brand-consistent (use traditional design); anything legally sensitive (provenance / IP).
Vision RAG
PDF → page-images → embed each via vision-aware embedding → store
Query → embed → retrieve top-k page images → send to vision LLM
Bypasses OCR entirely. Tools like ColPali do this natively. Best for diagrams, formulas, complex layouts where OCR loses information.
Cost reality
Vision is significantly more expensive than text:
- A 1024×1024 image ≈ 1.6k input tokens.
- Multi-page PDF: 5 pages × 1.6k = 8k input tokens per call.
- At Sonnet rates: ~$0.024 per call.
For high-volume document processing: route easy ones to OCR + small text LLM; vision LLM only when text-only fails. See LLM Cost Optimization .
Pitfalls
- Hallucinated text: model “reads” text not actually in the image. Validate against OCR or human review for critical docs.
- Inconsistent layouts: tables across pages confuse models. Pre-process if you can.
- Privacy: images may contain PII / faces / sensitive data. Treat with same care as text PII.
Practical use cases I’ve shipped
- Receipt scanning: phone photo → JSON line items. ~95% accuracy with vision LLM, much higher than OCR alone.
- Dashboard alerts: screenshot grafana → “what changed?” — surprisingly good for triage.
- Form extraction: structured fields out of arbitrary PDFs.
- Visual QA: “Did the homepage layout change?” against a baseline screenshot.
When NOT to use vision
- Structured data already exists (CSV, JSON). Don’t render to image and re-extract.
- Strict accuracy requirements (medical, legal). Use a specialized OCR + verification pipeline.
- Cost-sensitive high volume. Text-only is 10× cheaper.
Common mistakes
1. Sending raw 8MP photos
Ten thousand tokens for an image when 1500-token would suffice.
2. No fallback when vision fails
Vision returns nonsense; pipeline blows up. Validate output; fall back to OCR; alert on confidence drops.
3. Trusting extracted text without verification
For high-stakes data: human-in-the-loop on a sample.
4. One-shot for complex docs
Big PDFs: split into pages, process individually, then aggregate.
5. Mixing modalities unnecessarily
Question is text-only? Don’t include the image. Saves tokens.
What I’d ship today
For a doc-understanding feature:
- Resize / convert to ~1500px JPEG.
- Vision LLM with structured-output schema for extraction.
- OCR cross-check on critical fields.
- Confidence threshold → human review if low.
- Cache by content hash if reprocessing.
- Cost monitoring per feature.
Read this next
If you want my multimodal extraction starter (vision + OCR + verification), it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .