LLM basics cheatsheet.

Providers

  • Anthropic: Claude (Opus 4.7, Sonnet 4.6, Haiku 4.5).
  • OpenAI: GPT-5, GPT-4o, o-series reasoning models.
  • Google: Gemini 2.5 / 3.
  • Meta: Llama 4 (open).
  • Mistral: Mistral Large, Codestral.
  • DeepSeek: cost-effective.

OpenAI API

from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

response = client.chat.completions.create(
    model="gpt-5",
    messages=[
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": "Hi"},
    ],
    temperature=0.7,
    max_tokens=1000,
)

print(response.choices[0].message.content)
print(response.usage)

Anthropic API

from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system="You are helpful.",
    messages=[
        {"role": "user", "content": "Hi"},
    ],
)

print(response.content[0].text)
print(response.usage)

Streaming

# OpenAI
stream = client.chat.completions.create(model="gpt-5", messages=msgs, stream=True)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

# Anthropic
with client.messages.stream(model="claude-opus-4-7", max_tokens=1024, messages=msgs) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Tokens

LLMs price by tokens.

# OpenAI
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Hello, world!")
print(len(tokens))    # 4

# Anthropic
client.count_tokens(...)

Rule of thumb: 1 token ≈ 4 characters of English text. ≈ 0.75 words.

Pricing (early 2026 ballpark)

InputOutputContext
Claude Haiku 4.5$1$5200k
Claude Sonnet 4.6$3$15200k+
Claude Opus 4.7$15$751M
GPT-5variesvaries256k
GPT-4o$5$15128k

Per million tokens. Cheap models for high-volume; premium for complex reasoning.

Parameters

temperature=0.7      # 0 = deterministic; 2 = chaotic
top_p=0.9            # nucleus sampling
max_tokens=1000      # output cap
stop=["\n\n"]        # stop sequences
seed=42              # reproducibility (best-effort)

Default: temperature=1.0. For factual/code: 0-0.3. For creative: 0.7-1.0.

System / user / assistant

messages = [
    {"role": "system", "content": "You output only JSON."},
    {"role": "user", "content": "List 3 colors."},
    {"role": "assistant", "content": '["red", "blue", "green"]'},
    {"role": "user", "content": "Now add yellow."},
]

System message guides behavior. User/assistant simulates conversation.

Multi-turn

history = []

def chat(user_msg):
    history.append({"role": "user", "content": user_msg})
    response = client.chat.completions.create(model="gpt-5", messages=history)
    msg = response.choices[0].message.content
    history.append({"role": "assistant", "content": msg})
    return msg

Watch context window growth — truncate or summarize older turns.

Error handling

from openai import APIError, RateLimitError
import time

def call_with_retry(messages, retries=3):
    for i in range(retries):
        try:
            return client.chat.completions.create(model="gpt-5", messages=messages)
        except RateLimitError:
            time.sleep(2 ** i)
        except APIError as e:
            if i == retries - 1: raise
            time.sleep(1)

Structured outputs (JSON mode)

# OpenAI
response = client.chat.completions.create(
    model="gpt-5",
    messages=[...],
    response_format={"type": "json_object"},
)

# Or with schema
response = client.chat.completions.parse(
    model="gpt-5",
    messages=[...],
    response_format=MyPydanticModel,
)

Anthropic supports JSON via prompt + tool use (see tool use cheatsheet).

Async

from openai import AsyncOpenAI

client = AsyncOpenAI()

async def chat(msg):
    r = await client.chat.completions.create(
        model="gpt-5", messages=[{"role": "user", "content": msg}]
    )
    return r.choices[0].message.content

Cost optimization

  1. Cache prompts (Anthropic prompt caching, OpenAI cached input).
  2. Use smaller models when possible.
  3. Cut output length.
  4. Batch requests (Anthropic batch, OpenAI batch).
  5. Stream to start UX sooner (not cheaper, but feels faster).

Prompt caching (Anthropic)

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[
        {"type": "text", "text": "Static long context", "cache_control": {"type": "ephemeral"}},
    ],
    messages=[{"role": "user", "content": "Q"}],
)

Subsequent calls reuse cached prefix (5min TTL); 10% the cost.

Free / local

  • Ollama: run Llama 4, Qwen, etc locally.
  • LM Studio: GUI.
  • vLLM / llama.cpp: production self-hosted inference.
ollama pull llama4:8b
ollama run llama4:8b "Hello"

Common mistakes

  • Hardcoding API keys.
  • No retries / rate limit handling.
  • Sending PII to third-party APIs without consent.
  • Treating LLM output as authoritative without validation.
  • Ignoring token cost; running up bills.

Read this next

If you want my LLM client wrapper (retry + cache + log), it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .