AI/LLM Cheatsheet 17 — Observability for LLMs

LLM observability cheatsheet.

What to track

Request: model, prompt, params, user.
Response: text, tool calls, finish reason.
Latency: time to first token, total.
Tokens: input, output, cached.
Cost: derived from tokens.
Errors: rate limits, timeouts.
User feedback: 👍/👎.

Tools

LangSmith: traces, eval, replays.
Helicone: proxy, logs, dashboards.
Phoenix (Arize): open-source observability.
OpenLLMetry: OpenTelemetry for LLMs.
Langfuse: open-source.

LangSmith

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=...

Captures all LangChain calls automatically.

Helicone

Proxy URL change:

client = OpenAI(
    api_key=...,
    base_url="https://oai.hconeai.com/v1",
    default_headers={"Helicone-Auth": f"Bearer {HELICONE_KEY}"},
)

Logs without code changes.

OpenTelemetry (OpenLLMetry)

from traceloop.sdk import Traceloop
Traceloop.init(app_name="myapp")

# All openai/anthropic calls auto-traced

Ship spans to Jaeger / Honeycomb / Tempo / DataDog.

Custom logging

import structlog

log = structlog.get_logger()

def chat(messages, model="gpt-5"):
    t0 = time.time()
    try:
        response = client.chat.completions.create(model=model, messages=messages)
        log.info("llm_call",
            model=model,
            in_tokens=response.usage.prompt_tokens,
            out_tokens=response.usage.completion_tokens,
            latency_ms=int((time.time() - t0) * 1000),
            user_id=current_user_id(),
            feature="chat",
        )
        return response
    except Exception as e:
        log.error("llm_call_failed", model=model, error=str(e))
        raise

Trace IDs (correlate)

import uuid
trace_id = str(uuid.uuid4())

# Attach to logs and pass to LLM as request_id
log = log.bind(trace_id=trace_id)

User feedback

@app.post("/feedback")
def feedback(message_id, rating):
    db.execute("UPDATE messages SET rating = ? WHERE id = ?", rating, message_id)

Stored with original request → use for fine-tuning / DPO later.

Replay

Save inputs; replay against new prompt/model:

for case in production_log[-100:]:
    new_response = llm(case["prompt"], model="claude-opus-4-7")
    judge_compare(case["response"], new_response)

A/B in prod

variant = hash(user_id) % 2
prompt = PROMPT_A if variant == 0 else PROMPT_B
response = llm(prompt, ...)
log.info("ab", variant=variant, ...)

Track outcome metrics per variant.

Cost monitoring

COST = {
    "gpt-5": {"in": 0.005 / 1000, "out": 0.015 / 1000},
    "claude-opus-4-7": {"in": 0.015 / 1000, "out": 0.075 / 1000},
}

def estimate_cost(usage, model):
    return usage.prompt_tokens * COST[model]["in"] + usage.completion_tokens * COST[model]["out"]

Aggregate per user, per feature, per day.

Alerts

Error rate > X%.
Latency p99 > Y seconds.
Cost spike > 2x baseline.
Failure on specific user → check for jailbreak attempts.

Sampling

For high-volume: sample 5-10% for full logging; counts for all.

PII redaction in logs

def redact(text):
    text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
    return text

log.info("llm_input", prompt=redact(prompt))

Eval in prod

Sample N% of responses; LLM-judge them; track quality drift.

if random() < 0.05:
    judge_score = judge_llm(prompt, response)
    log.info("eval", score=judge_score)

Slow query log

if latency_ms > 10000:
    log.warning("slow_llm", prompt=prompt[:500], latency_ms=latency_ms)

Dashboards

Key panels:

Requests/sec.
p50/p95/p99 latency.
Token throughput.
Cost per hour.
Error rate.
User satisfaction (👍 / 👎).
Cache hit rate.

Failure modes to monitor

Rate limit hits.
Context window overflows.
Tool call errors.
Validation failures.
User reports.

Common mistakes

No logging → can’t debug.
Logging full prompts with PII.
No cost dashboard until bill arrives.
No feedback loop from users.
Treating uptime % as quality measure (it’s not).

What to track#

Tools#

LangSmith#

Helicone#

OpenTelemetry (OpenLLMetry)#

Custom logging#

Trace IDs (correlate)#

User feedback#

Replay#

A/B in prod#

Cost monitoring#

Alerts#

Sampling#

PII redaction in logs#

Eval in prod#

Slow query log#

Dashboards#

Failure modes to monitor#

Common mistakes#

Read this next#