LLM apps fail differently from normal services. Outputs degrade silently; cost blows out; a prompt change ships and quality drops 10%. Without observability, you find out from a Twitter screenshot. This post is the working guide to LLM observability in 2026.

What’s worth tracking

Three layers:

1. Per-call data

For every LLM API call:

  • Inputs: model, prompt (or hash), tools, parameters.
  • Outputs: response text/tools, finish reason.
  • Tokens: input, output, cached, reasoning.
  • Cost: dollars (calculated from tokens × model price).
  • Latency: total, time-to-first-token, time-to-completion.
  • Status: success / failure / rate-limited.

2. Per-feature aggregates

For each user-facing feature (triage, summarize, chat):

  • Call rate.
  • p50 / p95 / p99 latency.
  • Error rate.
  • Cost per call.
  • Eval score on a held-out set (run nightly).

3. Per-user / per-tenant

  • Rate of usage.
  • Cost attribution.
  • Anomalies (spike from one user → likely bug or abuse).

Without these, “feature X is suddenly $5k/day” or “p95 latency tripled this morning” are invisible until users complain.

The tooling landscape

TypeStrengthsBest for
LangSmithClosed SaaSBest UI, prompt mgmt, deep LangChain integrationLangChain shops
LangfuseOpen sourceSelf-host, framework-agnosticPrivacy-critical, polyglot
HeliconeClosed SaaSCheapest tier, OpenAI-compatible proxyQuick start
Arize PhoenixOpen sourceStrong evals + tracesEval-driven teams
BraintrustClosed SaaSEval-firstCompanies that ship evals like code
OTel + Tempo/Datadog/HoneycombOpen standardGeneric; integrates with rest of stackTeams with existing OTel

For a typical 2026 stack:

  • OTel GenAI for traces (lives in your existing observability stack).
  • Langfuse or LangSmith for prompt-level dashboards, replay, evals.

The two layers complement each other. OTel for “the API call shape”; LLM tools for “what was the prompt and how did it score?”

OpenTelemetry GenAI conventions

The OTel community standardized GenAI semantic conventions in 2025. Spans now have:

gen_ai.system          = "anthropic"
gen_ai.request.model   = "claude-sonnet-4-6"
gen_ai.usage.input_tokens = 1240
gen_ai.usage.output_tokens = 380
gen_ai.response.model  = "claude-sonnet-4-6"
gen_ai.response.finish_reason = "end_turn"

Plus events for individual messages. Most major SDKs (anthropic, openai, langchain) auto-instrument when OTel is configured.

from opentelemetry.instrumentation.anthropic import AnthropicInstrumentor
AnthropicInstrumentor().instrument()

Now every Anthropic call is a span with model, tokens, cost-derivable info, in your existing tracing backend. See OpenTelemetry End-to-End in 2026 .

LangSmith / Langfuse traces

What these tools add over plain OTel:

  • Hierarchical traces of multi-step chains and agents.
  • Prompt versioning — see which prompt produced which output.
  • Eval-on-traces — score historical traces with a new evaluator.
  • Side-by-side compare — old prompt vs new prompt on the same input.
  • Search and filter — “find me all errors mentioning ‘invoice’ in the last week.”
  • Replay — re-run a problematic trace with a different prompt or model.

These are workflows that don’t fit OTel semantics well. You want both.

Wiring Langfuse

from langfuse import Langfuse

langfuse = Langfuse()                            # reads LANGFUSE_PUBLIC_KEY etc.

@langfuse.observe()                              # decorator wraps the function
async def triage_ticket(ticket: str) -> str:
    response = await anthropic.messages.create(
        model="claude-haiku-4-5",
        messages=[{"role": "user", "content": ticket}],
    )
    return response.content[0].text

Every call shows up in Langfuse with input, output, tokens, latency. Add metadata={"feature": "triage", "user_id": uid} for filtering.

For agents, Langfuse traces nested calls automatically — you see the multi-step flow.

Cost dashboards

The metric leadership cares about most:

CREATE TABLE llm_spend AS
SELECT
  date_trunc('day', ts) AS day,
  feature,
  model,
  SUM(input_tokens) AS in_toks,
  SUM(output_tokens) AS out_toks,
  SUM(cost_usd) AS cost
FROM llm_calls
GROUP BY 1, 2, 3;

Plot in Grafana. Alert on per-feature cost > daily threshold.

For cost optimization tactics see LLM Cost Optimization .

Eval-on-trace

The killer 2026 pattern. You logged 1000 production calls last week. Run them through an LLM-judge eval on the trace data, after the fact:

for trace in last_week_traces:
    score = judge_llm.score(trace.prompt, trace.response, criterion="helpful")
    langfuse.score(trace_id=trace.id, name="helpful", value=score)

Now you have eval scores on real traffic. Notice quality dipped after Tuesday’s deploy? You can prove it.

For eval mechanics see LLM Evaluations .

Privacy

LLM traces include user prompts. Plan:

  • PII redaction before logging (regex + LLM-based).
  • Per-tenant data isolation if you’re multi-tenant.
  • Self-hosted (Langfuse, Phoenix) if data can’t leave VPC.
  • Retention limits matched to your compliance posture.

Don’t ship until this is figured out. The cost of a privacy breach > the cost of any observability tool.

Common mistakes

1. Sampling too aggressively

LLM traces are valuable. Sample at 100% for the first months; reduce only when storage costs hurt.

2. Logging prompts verbatim with secrets

If users sometimes paste API keys into prompts (it happens), you’ve now logged those keys. Filter or hash.

3. No alert on cost

Cost dashboards without alerts are dashboards nobody looks at. Alert when daily cost > 1.5× rolling average.

4. No alert on latency

LLM latency varies. A regression from 800ms to 1500ms p95 is a real problem. Alert.

5. Treating LLM observability as the only observability

Your LLM app has a database, queue, web tier. They all need observability. LLM tools complement, not replace.

What I’d build day one

For a new LLM app:

  1. OpenTelemetry GenAI auto-instrumentation.
  2. Langfuse (self-hosted or cloud) for prompt-level visibility.
  3. Postgres llm_calls table for cost rollups.
  4. Grafana dashboard: per-feature cost, latency, error rate.
  5. Nightly eval CI with a 30-row eval set.

Hours of work. Pays back the first time you ship a regression.

Read this next

If you want a Docker Compose with Langfuse + Postgres + Grafana wired into a sample LLM app, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .