Multimodal LLMs cheatsheet.

OpenAI vision

response = client.chat.completions.create(
    model="gpt-5",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/img.jpg"}},
        ],
    }],
)

Base64:

import base64
img_b64 = base64.b64encode(open("img.jpg","rb").read()).decode()
"image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}

Anthropic vision

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": img_b64}},
            {"type": "text", "text": "Describe this image"},
        ],
    }],
)

Use cases

  • OCR / text extraction.
  • Image description / captioning.
  • Diagram understanding.
  • UI element detection.
  • Form parsing.
  • Document layout analysis.

Costs

Images cost as a function of size. Resize to ~1000-1500px max dimension for typical use.

from PIL import Image
img = Image.open("big.jpg")
img.thumbnail((1500, 1500))
img.save("small.jpg", quality=85)

Audio (Whisper)

audio = open("audio.mp3", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio,
    language="en",
)
print(transcript.text)

Whisper open-weights also: faster-whisper, whisperX.

Audio in chat models

OpenAI / Gemini accept audio:

"content": [
    {"type": "input_audio", "input_audio": {"data": b64_audio, "format": "wav"}},
    {"type": "text", "text": "Transcribe and summarize"},
],

TTS

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello",
)
response.stream_to_file("out.mp3")

Voices: alloy, echo, fable, onyx, nova, shimmer.

Video

Most LLMs sample frames + text. Gemini can natively process video files.

# Gemini
import google.generativeai as genai

model = genai.GenerativeModel("gemini-2.5")
result = model.generate_content([
    "Describe what happens in this video",
    {"mime_type": "video/mp4", "data": video_bytes},
])

For other LLMs: extract frames every N seconds, pass as images.

Image generation

# DALL-E
response = client.images.generate(
    model="dall-e-3",
    prompt="A serene mountain landscape",
    size="1024x1024",
    quality="hd",
    n=1,
)

# Stable Diffusion (local)
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
image = pipe("a cat").images[0]

Replicate, Midjourney, Flux for high-quality generation.

Image editing

response = client.images.edit(
    image=open("image.png", "rb"),
    mask=open("mask.png", "rb"),
    prompt="Add a sunset",
    n=1,
    size="1024x1024",
)

OCR

Multimodal LLMs do OCR well; for high-volume use Tesseract / EasyOCR / PaddleOCR for cost.

import easyocr
reader = easyocr.Reader(['en'])
results = reader.readtext('image.jpg')

Document understanding

Pages of PDFs:

import fitz                          # PyMuPDF

doc = fitz.open("doc.pdf")
images = [page.get_pixmap(dpi=150).tobytes("png") for page in doc]
# Pass each as image to vision LLM

Or use specialized: AWS Textract, Azure Document Intelligence, Google Document AI.

Embeddings for images

CLIP, SigLIP for text-image embeddings:

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

inputs = processor(images=image, return_tensors="pt")
features = model.get_image_features(**inputs)

Use for image search.

Common mistakes

  • Sending huge images (cost + slow).
  • No image preprocessing (orientation, contrast).
  • Mixing image + text confusingly in prompt.
  • Forgetting that LLMs hallucinate on images too.
  • Treating OCR results as authoritative without validation.

Read this next

If you want my multimodal helpers, they’re at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .