LLM observability cheatsheet.
What to track
- Request: model, prompt, params, user.
- Response: text, tool calls, finish reason.
- Latency: time to first token, total.
- Tokens: input, output, cached.
- Cost: derived from tokens.
- Errors: rate limits, timeouts.
- User feedback: 👍/👎.
Tools
- LangSmith: traces, eval, replays.
- Helicone: proxy, logs, dashboards.
- Phoenix (Arize): open-source observability.
- OpenLLMetry: OpenTelemetry for LLMs.
- Langfuse: open-source.
LangSmith
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=...
Captures all LangChain calls automatically.
Helicone
Proxy URL change:
client = OpenAI(
api_key=...,
base_url="https://oai.hconeai.com/v1",
default_headers={"Helicone-Auth": f"Bearer {HELICONE_KEY}"},
)
Logs without code changes.
OpenTelemetry (OpenLLMetry)
from traceloop.sdk import Traceloop
Traceloop.init(app_name="myapp")
# All openai/anthropic calls auto-traced
Ship spans to Jaeger / Honeycomb / Tempo / DataDog.
Custom logging
import structlog
log = structlog.get_logger()
def chat(messages, model="gpt-5"):
t0 = time.time()
try:
response = client.chat.completions.create(model=model, messages=messages)
log.info("llm_call",
model=model,
in_tokens=response.usage.prompt_tokens,
out_tokens=response.usage.completion_tokens,
latency_ms=int((time.time() - t0) * 1000),
user_id=current_user_id(),
feature="chat",
)
return response
except Exception as e:
log.error("llm_call_failed", model=model, error=str(e))
raise
Trace IDs (correlate)
import uuid
trace_id = str(uuid.uuid4())
# Attach to logs and pass to LLM as request_id
log = log.bind(trace_id=trace_id)
User feedback
@app.post("/feedback")
def feedback(message_id, rating):
db.execute("UPDATE messages SET rating = ? WHERE id = ?", rating, message_id)
Stored with original request → use for fine-tuning / DPO later.
Replay
Save inputs; replay against new prompt/model:
for case in production_log[-100:]:
new_response = llm(case["prompt"], model="claude-opus-4-7")
judge_compare(case["response"], new_response)
A/B in prod
variant = hash(user_id) % 2
prompt = PROMPT_A if variant == 0 else PROMPT_B
response = llm(prompt, ...)
log.info("ab", variant=variant, ...)
Track outcome metrics per variant.
Cost monitoring
COST = {
"gpt-5": {"in": 0.005 / 1000, "out": 0.015 / 1000},
"claude-opus-4-7": {"in": 0.015 / 1000, "out": 0.075 / 1000},
}
def estimate_cost(usage, model):
return usage.prompt_tokens * COST[model]["in"] + usage.completion_tokens * COST[model]["out"]
Aggregate per user, per feature, per day.
Alerts
- Error rate > X%.
- Latency p99 > Y seconds.
- Cost spike > 2x baseline.
- Failure on specific user → check for jailbreak attempts.
Sampling
For high-volume: sample 5-10% for full logging; counts for all.
PII redaction in logs
def redact(text):
text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
return text
log.info("llm_input", prompt=redact(prompt))
Eval in prod
Sample N% of responses; LLM-judge them; track quality drift.
if random() < 0.05:
judge_score = judge_llm(prompt, response)
log.info("eval", score=judge_score)
Slow query log
if latency_ms > 10000:
log.warning("slow_llm", prompt=prompt[:500], latency_ms=latency_ms)
Dashboards
Key panels:
- Requests/sec.
- p50/p95/p99 latency.
- Token throughput.
- Cost per hour.
- Error rate.
- User satisfaction (👍 / 👎).
- Cache hit rate.
Failure modes to monitor
- Rate limit hits.
- Context window overflows.
- Tool call errors.
- Validation failures.
- User reports.
Common mistakes
- No logging → can’t debug.
- Logging full prompts with PII.
- No cost dashboard until bill arrives.
- No feedback loop from users.
- Treating uptime % as quality measure (it’s not).
Read this next
If you want my LLM observability stack, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .