Do I need LLM-specific observability or is OpenTelemetry enough?

For most teams, both. OpenTelemetry GenAI conventions give you traces and metrics that flow into your existing stack. LLM-specific tools (LangSmith, Langfuse, Phoenix) add prompt management, evals, and replay. They complement each other.

LangSmith vs Langfuse?

LangSmith is closed-source, polished, tightly integrated with LangChain. Langfuse is open-source, self-hostable, framework-agnostic. For LangChain shops: LangSmith. For privacy-critical or non-LangChain stacks: Langfuse.

What metrics should every LLM app track?

Per-call: input tokens, output tokens, model, latency, status, cost. Per-feature: error rate, p95 latency, eval score on held-out set. Per-user: rate of usage, cost. Without these you can't diagnose regressions.

LLM Observability in 2026 — LangSmith, Langfuse, Helicone, and OpenTelemetry

LLM apps fail differently from normal services. Outputs degrade silently; cost blows out; a prompt change ships and quality drops 10%. Without observability, you find out from a Twitter screenshot. This post is the working guide to LLM observability in 2026.

What’s worth tracking

Three layers:

1. Per-call data

For every LLM API call:

Inputs: model, prompt (or hash), tools, parameters.
Outputs: response text/tools, finish reason.
Tokens: input, output, cached, reasoning.
Cost: dollars (calculated from tokens × model price).
Latency: total, time-to-first-token, time-to-completion.
Status: success / failure / rate-limited.

2. Per-feature aggregates

For each user-facing feature (triage, summarize, chat):

Call rate.
p50 / p95 / p99 latency.
Error rate.
Cost per call.
Eval score on a held-out set (run nightly).

3. Per-user / per-tenant

Rate of usage.
Cost attribution.
Anomalies (spike from one user → likely bug or abuse).

Without these, “feature X is suddenly $5k/day” or “p95 latency tripled this morning” are invisible until users complain.

The tooling landscape

	Type	Strengths	Best for
LangSmith	Closed SaaS	Best UI, prompt mgmt, deep LangChain integration	LangChain shops
Langfuse	Open source	Self-host, framework-agnostic	Privacy-critical, polyglot
Helicone	Closed SaaS	Cheapest tier, OpenAI-compatible proxy	Quick start
Arize Phoenix	Open source	Strong evals + traces	Eval-driven teams
Braintrust	Closed SaaS	Eval-first	Companies that ship evals like code
OTel + Tempo/Datadog/Honeycomb	Open standard	Generic; integrates with rest of stack	Teams with existing OTel

For a typical 2026 stack:

OTel GenAI for traces (lives in your existing observability stack).
Langfuse or LangSmith for prompt-level dashboards, replay, evals.

The two layers complement each other. OTel for “the API call shape”; LLM tools for “what was the prompt and how did it score?”

OpenTelemetry GenAI conventions

The OTel community standardized GenAI semantic conventions in 2025. Spans now have:

gen_ai.system          = "anthropic"
gen_ai.request.model   = "claude-sonnet-4-6"
gen_ai.usage.input_tokens = 1240
gen_ai.usage.output_tokens = 380
gen_ai.response.model  = "claude-sonnet-4-6"
gen_ai.response.finish_reason = "end_turn"

Plus events for individual messages. Most major SDKs (anthropic, openai, langchain) auto-instrument when OTel is configured.

from opentelemetry.instrumentation.anthropic import AnthropicInstrumentor
AnthropicInstrumentor().instrument()

Now every Anthropic call is a span with model, tokens, cost-derivable info, in your existing tracing backend. See OpenTelemetry End-to-End in 2026 .

LangSmith / Langfuse traces

What these tools add over plain OTel:

Hierarchical traces of multi-step chains and agents.
Prompt versioning — see which prompt produced which output.
Eval-on-traces — score historical traces with a new evaluator.
Side-by-side compare — old prompt vs new prompt on the same input.
Search and filter — “find me all errors mentioning ‘invoice’ in the last week.”
Replay — re-run a problematic trace with a different prompt or model.

These are workflows that don’t fit OTel semantics well. You want both.

Wiring Langfuse

from langfuse import Langfuse

langfuse = Langfuse()                            # reads LANGFUSE_PUBLIC_KEY etc.

@langfuse.observe()                              # decorator wraps the function
async def triage_ticket(ticket: str) -> str:
    response = await anthropic.messages.create(
        model="claude-haiku-4-5",
        messages=[{"role": "user", "content": ticket}],
    )
    return response.content[0].text

Every call shows up in Langfuse with input, output, tokens, latency. Add metadata={"feature": "triage", "user_id": uid} for filtering.

For agents, Langfuse traces nested calls automatically — you see the multi-step flow.

Cost dashboards

The metric leadership cares about most:

CREATE TABLE llm_spend AS
SELECT
  date_trunc('day', ts) AS day,
  feature,
  model,
  SUM(input_tokens) AS in_toks,
  SUM(output_tokens) AS out_toks,
  SUM(cost_usd) AS cost
FROM llm_calls
GROUP BY 1, 2, 3;

Plot in Grafana. Alert on per-feature cost > daily threshold.

For cost optimization tactics see LLM Cost Optimization .

Eval-on-trace

The killer 2026 pattern. You logged 1000 production calls last week. Run them through an LLM-judge eval on the trace data, after the fact:

for trace in last_week_traces:
    score = judge_llm.score(trace.prompt, trace.response, criterion="helpful")
    langfuse.score(trace_id=trace.id, name="helpful", value=score)

Now you have eval scores on real traffic. Notice quality dipped after Tuesday’s deploy? You can prove it.

For eval mechanics see LLM Evaluations .

Privacy

LLM traces include user prompts. Plan:

PII redaction before logging (regex + LLM-based).
Per-tenant data isolation if you’re multi-tenant.
Self-hosted (Langfuse, Phoenix) if data can’t leave VPC.
Retention limits matched to your compliance posture.

Don’t ship until this is figured out. The cost of a privacy breach > the cost of any observability tool.

Common mistakes

1. Sampling too aggressively

LLM traces are valuable. Sample at 100% for the first months; reduce only when storage costs hurt.

2. Logging prompts verbatim with secrets

If users sometimes paste API keys into prompts (it happens), you’ve now logged those keys. Filter or hash.

3. No alert on cost

Cost dashboards without alerts are dashboards nobody looks at. Alert when daily cost > 1.5× rolling average.

4. No alert on latency

LLM latency varies. A regression from 800ms to 1500ms p95 is a real problem. Alert.

5. Treating LLM observability as the only observability

Your LLM app has a database, queue, web tier. They all need observability. LLM tools complement, not replace.

What I’d build day one

For a new LLM app:

OpenTelemetry GenAI auto-instrumentation.
Langfuse (self-hosted or cloud) for prompt-level visibility.
Postgres llm_calls table for cost rollups.
Grafana dashboard: per-feature cost, latency, error rate.
Nightly eval CI with a 30-row eval set.

Hours of work. Pays back the first time you ship a regression.

Read this next

If you want a Docker Compose with Langfuse + Postgres + Grafana wired into a sample LLM app, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

What’s worth tracking#

1. Per-call data#

2. Per-feature aggregates#

3. Per-user / per-tenant#

The tooling landscape#

OpenTelemetry GenAI conventions#

LangSmith / Langfuse traces#

Wiring Langfuse#

Cost dashboards#

Eval-on-trace#

Privacy#

Common mistakes#

1. Sampling too aggressively#

2. Logging prompts verbatim with secrets#

3. No alert on cost#

4. No alert on latency#

5. Treating LLM observability as the only observability#

What I’d build day one#

Read this next#