LLM basics cheatsheet.
Providers
- Anthropic: Claude (Opus 4.7, Sonnet 4.6, Haiku 4.5).
- OpenAI: GPT-5, GPT-4o, o-series reasoning models.
- Google: Gemini 2.5 / 3.
- Meta: Llama 4 (open).
- Mistral: Mistral Large, Codestral.
- DeepSeek: cost-effective.
OpenAI API
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
model="gpt-5",
messages=[
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hi"},
],
temperature=0.7,
max_tokens=1000,
)
print(response.choices[0].message.content)
print(response.usage)
Anthropic API
from anthropic import Anthropic
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system="You are helpful.",
messages=[
{"role": "user", "content": "Hi"},
],
)
print(response.content[0].text)
print(response.usage)
Streaming
# OpenAI
stream = client.chat.completions.create(model="gpt-5", messages=msgs, stream=True)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
# Anthropic
with client.messages.stream(model="claude-opus-4-7", max_tokens=1024, messages=msgs) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
Tokens
LLMs price by tokens.
# OpenAI
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Hello, world!")
print(len(tokens)) # 4
# Anthropic
client.count_tokens(...)
Rule of thumb: 1 token ≈ 4 characters of English text. ≈ 0.75 words.
Pricing (early 2026 ballpark)
| Input | Output | Context | |
|---|---|---|---|
| Claude Haiku 4.5 | $1 | $5 | 200k |
| Claude Sonnet 4.6 | $3 | $15 | 200k+ |
| Claude Opus 4.7 | $15 | $75 | 1M |
| GPT-5 | varies | varies | 256k |
| GPT-4o | $5 | $15 | 128k |
Per million tokens. Cheap models for high-volume; premium for complex reasoning.
Parameters
temperature=0.7 # 0 = deterministic; 2 = chaotic
top_p=0.9 # nucleus sampling
max_tokens=1000 # output cap
stop=["\n\n"] # stop sequences
seed=42 # reproducibility (best-effort)
Default: temperature=1.0. For factual/code: 0-0.3. For creative: 0.7-1.0.
System / user / assistant
messages = [
{"role": "system", "content": "You output only JSON."},
{"role": "user", "content": "List 3 colors."},
{"role": "assistant", "content": '["red", "blue", "green"]'},
{"role": "user", "content": "Now add yellow."},
]
System message guides behavior. User/assistant simulates conversation.
Multi-turn
history = []
def chat(user_msg):
history.append({"role": "user", "content": user_msg})
response = client.chat.completions.create(model="gpt-5", messages=history)
msg = response.choices[0].message.content
history.append({"role": "assistant", "content": msg})
return msg
Watch context window growth — truncate or summarize older turns.
Error handling
from openai import APIError, RateLimitError
import time
def call_with_retry(messages, retries=3):
for i in range(retries):
try:
return client.chat.completions.create(model="gpt-5", messages=messages)
except RateLimitError:
time.sleep(2 ** i)
except APIError as e:
if i == retries - 1: raise
time.sleep(1)
Structured outputs (JSON mode)
# OpenAI
response = client.chat.completions.create(
model="gpt-5",
messages=[...],
response_format={"type": "json_object"},
)
# Or with schema
response = client.chat.completions.parse(
model="gpt-5",
messages=[...],
response_format=MyPydanticModel,
)
Anthropic supports JSON via prompt + tool use (see tool use cheatsheet).
Async
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def chat(msg):
r = await client.chat.completions.create(
model="gpt-5", messages=[{"role": "user", "content": msg}]
)
return r.choices[0].message.content
Cost optimization
- Cache prompts (Anthropic prompt caching, OpenAI cached input).
- Use smaller models when possible.
- Cut output length.
- Batch requests (Anthropic batch, OpenAI batch).
- Stream to start UX sooner (not cheaper, but feels faster).
Prompt caching (Anthropic)
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[
{"type": "text", "text": "Static long context", "cache_control": {"type": "ephemeral"}},
],
messages=[{"role": "user", "content": "Q"}],
)
Subsequent calls reuse cached prefix (5min TTL); 10% the cost.
Free / local
- Ollama: run Llama 4, Qwen, etc locally.
- LM Studio: GUI.
- vLLM / llama.cpp: production self-hosted inference.
ollama pull llama4:8b
ollama run llama4:8b "Hello"
Common mistakes
- Hardcoding API keys.
- No retries / rate limit handling.
- Sending PII to third-party APIs without consent.
- Treating LLM output as authoritative without validation.
- Ignoring token cost; running up bills.
Read this next
If you want my LLM client wrapper (retry + cache + log), it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .