LLM apps fail differently from normal services. Outputs degrade silently; cost blows out; a prompt change ships and quality drops 10%. Without observability, you find out from a Twitter screenshot. This post is the working guide to LLM observability in 2026.
What’s worth tracking
Three layers:
1. Per-call data
For every LLM API call:
- Inputs: model, prompt (or hash), tools, parameters.
- Outputs: response text/tools, finish reason.
- Tokens: input, output, cached, reasoning.
- Cost: dollars (calculated from tokens × model price).
- Latency: total, time-to-first-token, time-to-completion.
- Status: success / failure / rate-limited.
2. Per-feature aggregates
For each user-facing feature (triage, summarize, chat):
- Call rate.
- p50 / p95 / p99 latency.
- Error rate.
- Cost per call.
- Eval score on a held-out set (run nightly).
3. Per-user / per-tenant
- Rate of usage.
- Cost attribution.
- Anomalies (spike from one user → likely bug or abuse).
Without these, “feature X is suddenly $5k/day” or “p95 latency tripled this morning” are invisible until users complain.
The tooling landscape
| Type | Strengths | Best for | |
|---|---|---|---|
| LangSmith | Closed SaaS | Best UI, prompt mgmt, deep LangChain integration | LangChain shops |
| Langfuse | Open source | Self-host, framework-agnostic | Privacy-critical, polyglot |
| Helicone | Closed SaaS | Cheapest tier, OpenAI-compatible proxy | Quick start |
| Arize Phoenix | Open source | Strong evals + traces | Eval-driven teams |
| Braintrust | Closed SaaS | Eval-first | Companies that ship evals like code |
| OTel + Tempo/Datadog/Honeycomb | Open standard | Generic; integrates with rest of stack | Teams with existing OTel |
For a typical 2026 stack:
- OTel GenAI for traces (lives in your existing observability stack).
- Langfuse or LangSmith for prompt-level dashboards, replay, evals.
The two layers complement each other. OTel for “the API call shape”; LLM tools for “what was the prompt and how did it score?”
OpenTelemetry GenAI conventions
The OTel community standardized GenAI semantic conventions in 2025. Spans now have:
gen_ai.system = "anthropic"
gen_ai.request.model = "claude-sonnet-4-6"
gen_ai.usage.input_tokens = 1240
gen_ai.usage.output_tokens = 380
gen_ai.response.model = "claude-sonnet-4-6"
gen_ai.response.finish_reason = "end_turn"
Plus events for individual messages. Most major SDKs (anthropic, openai, langchain) auto-instrument when OTel is configured.
from opentelemetry.instrumentation.anthropic import AnthropicInstrumentor
AnthropicInstrumentor().instrument()
Now every Anthropic call is a span with model, tokens, cost-derivable info, in your existing tracing backend. See OpenTelemetry End-to-End in 2026 .
LangSmith / Langfuse traces
What these tools add over plain OTel:
- Hierarchical traces of multi-step chains and agents.
- Prompt versioning — see which prompt produced which output.
- Eval-on-traces — score historical traces with a new evaluator.
- Side-by-side compare — old prompt vs new prompt on the same input.
- Search and filter — “find me all errors mentioning ‘invoice’ in the last week.”
- Replay — re-run a problematic trace with a different prompt or model.
These are workflows that don’t fit OTel semantics well. You want both.
Wiring Langfuse
from langfuse import Langfuse
langfuse = Langfuse() # reads LANGFUSE_PUBLIC_KEY etc.
@langfuse.observe() # decorator wraps the function
async def triage_ticket(ticket: str) -> str:
response = await anthropic.messages.create(
model="claude-haiku-4-5",
messages=[{"role": "user", "content": ticket}],
)
return response.content[0].text
Every call shows up in Langfuse with input, output, tokens, latency. Add metadata={"feature": "triage", "user_id": uid} for filtering.
For agents, Langfuse traces nested calls automatically — you see the multi-step flow.
Cost dashboards
The metric leadership cares about most:
CREATE TABLE llm_spend AS
SELECT
date_trunc('day', ts) AS day,
feature,
model,
SUM(input_tokens) AS in_toks,
SUM(output_tokens) AS out_toks,
SUM(cost_usd) AS cost
FROM llm_calls
GROUP BY 1, 2, 3;
Plot in Grafana. Alert on per-feature cost > daily threshold.
For cost optimization tactics see LLM Cost Optimization .
Eval-on-trace
The killer 2026 pattern. You logged 1000 production calls last week. Run them through an LLM-judge eval on the trace data, after the fact:
for trace in last_week_traces:
score = judge_llm.score(trace.prompt, trace.response, criterion="helpful")
langfuse.score(trace_id=trace.id, name="helpful", value=score)
Now you have eval scores on real traffic. Notice quality dipped after Tuesday’s deploy? You can prove it.
For eval mechanics see LLM Evaluations .
Privacy
LLM traces include user prompts. Plan:
- PII redaction before logging (regex + LLM-based).
- Per-tenant data isolation if you’re multi-tenant.
- Self-hosted (Langfuse, Phoenix) if data can’t leave VPC.
- Retention limits matched to your compliance posture.
Don’t ship until this is figured out. The cost of a privacy breach > the cost of any observability tool.
Common mistakes
1. Sampling too aggressively
LLM traces are valuable. Sample at 100% for the first months; reduce only when storage costs hurt.
2. Logging prompts verbatim with secrets
If users sometimes paste API keys into prompts (it happens), you’ve now logged those keys. Filter or hash.
3. No alert on cost
Cost dashboards without alerts are dashboards nobody looks at. Alert when daily cost > 1.5× rolling average.
4. No alert on latency
LLM latency varies. A regression from 800ms to 1500ms p95 is a real problem. Alert.
5. Treating LLM observability as the only observability
Your LLM app has a database, queue, web tier. They all need observability. LLM tools complement, not replace.
What I’d build day one
For a new LLM app:
- OpenTelemetry GenAI auto-instrumentation.
- Langfuse (self-hosted or cloud) for prompt-level visibility.
- Postgres
llm_callstable for cost rollups. - Grafana dashboard: per-feature cost, latency, error rate.
- Nightly eval CI with a 30-row eval set.
Hours of work. Pays back the first time you ship a regression.
Read this next
If you want a Docker Compose with Langfuse + Postgres + Grafana wired into a sample LLM app, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .