Multimodal LLMs cheatsheet.
OpenAI vision
response = client.chat.completions.create(
model="gpt-5",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/img.jpg"}},
],
}],
)
Base64:
import base64
img_b64 = base64.b64encode(open("img.jpg","rb").read()).decode()
"image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}
Anthropic vision
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": img_b64}},
{"type": "text", "text": "Describe this image"},
],
}],
)
Use cases
- OCR / text extraction.
- Image description / captioning.
- Diagram understanding.
- UI element detection.
- Form parsing.
- Document layout analysis.
Costs
Images cost as a function of size. Resize to ~1000-1500px max dimension for typical use.
from PIL import Image
img = Image.open("big.jpg")
img.thumbnail((1500, 1500))
img.save("small.jpg", quality=85)
Audio (Whisper)
audio = open("audio.mp3", "rb")
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio,
language="en",
)
print(transcript.text)
Whisper open-weights also: faster-whisper, whisperX.
Audio in chat models
OpenAI / Gemini accept audio:
"content": [
{"type": "input_audio", "input_audio": {"data": b64_audio, "format": "wav"}},
{"type": "text", "text": "Transcribe and summarize"},
],
TTS
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Hello",
)
response.stream_to_file("out.mp3")
Voices: alloy, echo, fable, onyx, nova, shimmer.
Video
Most LLMs sample frames + text. Gemini can natively process video files.
# Gemini
import google.generativeai as genai
model = genai.GenerativeModel("gemini-2.5")
result = model.generate_content([
"Describe what happens in this video",
{"mime_type": "video/mp4", "data": video_bytes},
])
For other LLMs: extract frames every N seconds, pass as images.
Image generation
# DALL-E
response = client.images.generate(
model="dall-e-3",
prompt="A serene mountain landscape",
size="1024x1024",
quality="hd",
n=1,
)
# Stable Diffusion (local)
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
image = pipe("a cat").images[0]
Replicate, Midjourney, Flux for high-quality generation.
Image editing
response = client.images.edit(
image=open("image.png", "rb"),
mask=open("mask.png", "rb"),
prompt="Add a sunset",
n=1,
size="1024x1024",
)
OCR
Multimodal LLMs do OCR well; for high-volume use Tesseract / EasyOCR / PaddleOCR for cost.
import easyocr
reader = easyocr.Reader(['en'])
results = reader.readtext('image.jpg')
Document understanding
Pages of PDFs:
import fitz # PyMuPDF
doc = fitz.open("doc.pdf")
images = [page.get_pixmap(dpi=150).tobytes("png") for page in doc]
# Pass each as image to vision LLM
Or use specialized: AWS Textract, Azure Document Intelligence, Google Document AI.
Embeddings for images
CLIP, SigLIP for text-image embeddings:
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
inputs = processor(images=image, return_tensors="pt")
features = model.get_image_features(**inputs)
Use for image search.
Common mistakes
- Sending huge images (cost + slow).
- No image preprocessing (orientation, contrast).
- Mixing image + text confusingly in prompt.
- Forgetting that LLMs hallucinate on images too.
- Treating OCR results as authoritative without validation.
Read this next
If you want my multimodal helpers, they’re at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .