Is OpenTelemetry enough for LLM apps?

OTEL handles HTTP / DB tracing fine but misses LLM-specific structure: token counts per call, prompt/response, tool calls. Use OTEL + a dedicated LLM observability tool (Langfuse, Arize Phoenix, Helicone).

Do I need a separate eval setup or can I just trace?

Both. Tracing tells you what happened. Evals tell you whether the output was right. Production tracing + offline evals on captured traces is the durable pattern.

LLM Observability in 2026 — Tracing, Evals, and the Things You Can't Skip

LLM apps fail differently from regular services. The HTTP returns 200; the response is wrong. Latency is fine; tokens overspend. No errors; users complain. You can’t debug what you can’t see. This post is the working playbook for visibility.

What to capture

For every LLM call:

Prompt (system + messages, with variable substitutions resolved).
Response (full text or tool call).
Token counts (input cached, input fresh, output).
Latency (TTFB, total).
Model + version.
Cost (computed from tokens × price).
User / session ID.
Trace ID (links to other spans).
Tool calls (each tool invocation, args, result).
Status (success / error / refused).

Save it all. Storage is cheap; debugging without it is impossible.

Tools

	Strengths
Langfuse	OSS-friendly, self-hostable, prompt versioning
Arize Phoenix	OSS, evals + tracing, OpenInference
Helicone	Cheapest setup; proxy-based
LangSmith	LangChain-native; mature
Datadog LLM Obs	Existing DD shops; integrates with APM
Hand-rolled	OTEL spans + a Postgres table

For most teams: Langfuse (OSS) or Arize Phoenix (OSS). For enterprise: LangSmith or Datadog.

Langfuse trace

from langfuse import Langfuse

lf = Langfuse()

trace = lf.trace(name="chat-handler", user_id=user.id, session_id=session_id)
gen = trace.generation(
    name="answer",
    model="claude-sonnet-4-6",
    input=messages,
)
resp = await client.messages.create(...)
gen.end(
    output=resp.content[0].text,
    usage={"input": resp.usage.input_tokens, "output": resp.usage.output_tokens},
)

Web UI: see every trace, drill into prompts/responses, replay, fork prompts.

OpenInference + OTEL

For OTEL shops:

from openinference.instrumentation.anthropic import AnthropicInstrumentor

AnthropicInstrumentor().instrument()
# Now every Anthropic call emits OTEL spans with LLM-specific attributes

Spans include llm.model_name, llm.token_count.input, etc. Send to Phoenix / Datadog / Tempo.

Eval pipelines

Tracing without evals = beautiful logs of bad answers. Combine.

async def eval_set(traces, judge):
    results = []
    for t in traces:
        score = await judge(t.input, t.output)
        results.append({"trace_id": t.id, "score": score.score, "reason": score.reason})
    return results

Run on a subset of production traces nightly. Score with an LLM judge or rule-based check.

@dataclass
class EvalCase:
    input: str
    expected_substring: str

async def regress(cases: list[EvalCase], answer_fn):
    fails = []
    for c in cases:
        out = await answer_fn(c.input)
        if c.expected_substring not in out:
            fails.append(c)
    return fails

Run on every PR before deploy. Fails block deploy. See LLM Evaluation .

Prompt versioning

PROMPT_V = "answer-v3"

@lf.observe()
def answer(question):
    p = lf.get_prompt(PROMPT_V)  # versioned prompt fetched from registry
    return llm.complete(p.compile(question=question))

When you change a prompt: bump version. Old traces still reference the old version. Compare quality across versions.

Regression detection

async def detect_regression():
    yesterday = await traces.where(date=yesterday)
    today = await traces.where(date=today)
    
    if today.error_rate > yesterday.error_rate * 1.5:
        alert("LLM error rate spike")
    if today.avg_latency > yesterday.avg_latency * 1.5:
        alert("LLM latency spike")
    if today.refusal_rate > 0.05:
        alert("LLM refusing too often")

Alert on quality regressions, not just outages.

User feedback

@app.post("/feedback")
async def feedback(trace_id: str, rating: int, comment: str = None):
    await lf.score(trace_id=trace_id, name="user_rating", value=rating, comment=comment)

Build feedback into the UI. Thumbs up / down per response. Feeds back into eval set.

Cost dashboards

SELECT
  feature,
  SUM(cost_usd) AS spend,
  COUNT(*) AS calls,
  AVG(cost_usd) AS cost_per_call
FROM llm_traces
WHERE ts > now() - interval '7 days'
GROUP BY feature
ORDER BY spend DESC;

The 80/20 of spend is usually obvious from this query. Optimize the top features. See LLM Cost Optimization .

Sampling

Capturing every prompt/response is expensive at scale. Sample:

100% errors.
100% refusals.
10% normal traces.
100% high-cost / high-latency outliers.

Adjust based on storage budget.

PII redaction

Logs include user input. User input may include PII. Redact before storage:

def redact(text):
    text = EMAIL_PATTERN.sub("[EMAIL]", text)
    text = PHONE_PATTERN.sub("[PHONE]", text)
    return text

trace.log(input=redact(user_text))

Per-region storage if GDPR / HIPAA requires.

Common mistakes

1. No tracing

“It just works.” Until it doesn’t, and you have nothing to debug.

2. Tracing without eval

Beautiful traces of garbage answers. Eval on traces.

3. No version on prompts

You change a prompt; all traces reference different prompts under the same name. Useless for comparison.

4. Logging full prompts to OSS observability tools

Sensitive data in third-party SaaS. Self-host or aggressively redact.

5. Ignoring user feedback

Thumbs-down rate climbing; nobody notices because no one checks. Pipe to alerting.

What I’d ship today

For a fresh LLM app:

Langfuse self-hosted for tracing.
OTEL for the rest of the stack; correlate via trace ID.
Eval set of 50–200 cases run on every deploy.
Daily eval on production traces sample.
User feedback UI piped into traces.
Cost dashboard by feature.
Alerts on error rate, latency, refusal rate.

Read this next

If you want my Langfuse + eval pipeline starter, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

What to capture#

Tools#

Langfuse trace#

OpenInference + OTEL#

Eval pipelines#

Prompt versioning#

Regression detection#

User feedback#

Cost dashboards#

Sampling#

PII redaction#

Common mistakes#

1. No tracing#

2. Tracing without eval#

3. No version on prompts#

4. Logging full prompts to OSS observability tools#

5. Ignoring user feedback#

What I’d ship today#

Read this next#