LLM observability cheatsheet.

What to track

  • Request: model, prompt, params, user.
  • Response: text, tool calls, finish reason.
  • Latency: time to first token, total.
  • Tokens: input, output, cached.
  • Cost: derived from tokens.
  • Errors: rate limits, timeouts.
  • User feedback: 👍/👎.

Tools

  • LangSmith: traces, eval, replays.
  • Helicone: proxy, logs, dashboards.
  • Phoenix (Arize): open-source observability.
  • OpenLLMetry: OpenTelemetry for LLMs.
  • Langfuse: open-source.

LangSmith

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=...

Captures all LangChain calls automatically.

Helicone

Proxy URL change:

client = OpenAI(
    api_key=...,
    base_url="https://oai.hconeai.com/v1",
    default_headers={"Helicone-Auth": f"Bearer {HELICONE_KEY}"},
)

Logs without code changes.

OpenTelemetry (OpenLLMetry)

from traceloop.sdk import Traceloop
Traceloop.init(app_name="myapp")

# All openai/anthropic calls auto-traced

Ship spans to Jaeger / Honeycomb / Tempo / DataDog.

Custom logging

import structlog

log = structlog.get_logger()

def chat(messages, model="gpt-5"):
    t0 = time.time()
    try:
        response = client.chat.completions.create(model=model, messages=messages)
        log.info("llm_call",
            model=model,
            in_tokens=response.usage.prompt_tokens,
            out_tokens=response.usage.completion_tokens,
            latency_ms=int((time.time() - t0) * 1000),
            user_id=current_user_id(),
            feature="chat",
        )
        return response
    except Exception as e:
        log.error("llm_call_failed", model=model, error=str(e))
        raise

Trace IDs (correlate)

import uuid
trace_id = str(uuid.uuid4())

# Attach to logs and pass to LLM as request_id
log = log.bind(trace_id=trace_id)

User feedback

@app.post("/feedback")
def feedback(message_id, rating):
    db.execute("UPDATE messages SET rating = ? WHERE id = ?", rating, message_id)

Stored with original request → use for fine-tuning / DPO later.

Replay

Save inputs; replay against new prompt/model:

for case in production_log[-100:]:
    new_response = llm(case["prompt"], model="claude-opus-4-7")
    judge_compare(case["response"], new_response)

A/B in prod

variant = hash(user_id) % 2
prompt = PROMPT_A if variant == 0 else PROMPT_B
response = llm(prompt, ...)
log.info("ab", variant=variant, ...)

Track outcome metrics per variant.

Cost monitoring

COST = {
    "gpt-5": {"in": 0.005 / 1000, "out": 0.015 / 1000},
    "claude-opus-4-7": {"in": 0.015 / 1000, "out": 0.075 / 1000},
}

def estimate_cost(usage, model):
    return usage.prompt_tokens * COST[model]["in"] + usage.completion_tokens * COST[model]["out"]

Aggregate per user, per feature, per day.

Alerts

  • Error rate > X%.
  • Latency p99 > Y seconds.
  • Cost spike > 2x baseline.
  • Failure on specific user → check for jailbreak attempts.

Sampling

For high-volume: sample 5-10% for full logging; counts for all.

PII redaction in logs

def redact(text):
    text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
    return text

log.info("llm_input", prompt=redact(prompt))

Eval in prod

Sample N% of responses; LLM-judge them; track quality drift.

if random() < 0.05:
    judge_score = judge_llm(prompt, response)
    log.info("eval", score=judge_score)

Slow query log

if latency_ms > 10000:
    log.warning("slow_llm", prompt=prompt[:500], latency_ms=latency_ms)

Dashboards

Key panels:

  • Requests/sec.
  • p50/p95/p99 latency.
  • Token throughput.
  • Cost per hour.
  • Error rate.
  • User satisfaction (👍 / 👎).
  • Cache hit rate.

Failure modes to monitor

  • Rate limit hits.
  • Context window overflows.
  • Tool call errors.
  • Validation failures.
  • User reports.

Common mistakes

  • No logging → can’t debug.
  • Logging full prompts with PII.
  • No cost dashboard until bill arrives.
  • No feedback loop from users.
  • Treating uptime % as quality measure (it’s not).

Read this next

If you want my LLM observability stack, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .