LLM apps fail differently from regular services. The HTTP returns 200; the response is wrong. Latency is fine; tokens overspend. No errors; users complain. You can’t debug what you can’t see. This post is the working playbook for visibility.

What to capture

For every LLM call:

  • Prompt (system + messages, with variable substitutions resolved).
  • Response (full text or tool call).
  • Token counts (input cached, input fresh, output).
  • Latency (TTFB, total).
  • Model + version.
  • Cost (computed from tokens × price).
  • User / session ID.
  • Trace ID (links to other spans).
  • Tool calls (each tool invocation, args, result).
  • Status (success / error / refused).

Save it all. Storage is cheap; debugging without it is impossible.

Tools

Strengths
LangfuseOSS-friendly, self-hostable, prompt versioning
Arize PhoenixOSS, evals + tracing, OpenInference
HeliconeCheapest setup; proxy-based
LangSmithLangChain-native; mature
Datadog LLM ObsExisting DD shops; integrates with APM
Hand-rolledOTEL spans + a Postgres table

For most teams: Langfuse (OSS) or Arize Phoenix (OSS). For enterprise: LangSmith or Datadog.

Langfuse trace

from langfuse import Langfuse

lf = Langfuse()

trace = lf.trace(name="chat-handler", user_id=user.id, session_id=session_id)
gen = trace.generation(
    name="answer",
    model="claude-sonnet-4-6",
    input=messages,
)
resp = await client.messages.create(...)
gen.end(
    output=resp.content[0].text,
    usage={"input": resp.usage.input_tokens, "output": resp.usage.output_tokens},
)

Web UI: see every trace, drill into prompts/responses, replay, fork prompts.

OpenInference + OTEL

For OTEL shops:

from openinference.instrumentation.anthropic import AnthropicInstrumentor

AnthropicInstrumentor().instrument()
# Now every Anthropic call emits OTEL spans with LLM-specific attributes

Spans include llm.model_name, llm.token_count.input, etc. Send to Phoenix / Datadog / Tempo.

Eval pipelines

Tracing without evals = beautiful logs of bad answers. Combine.

async def eval_set(traces, judge):
    results = []
    for t in traces:
        score = await judge(t.input, t.output)
        results.append({"trace_id": t.id, "score": score.score, "reason": score.reason})
    return results

Run on a subset of production traces nightly. Score with an LLM judge or rule-based check.

@dataclass
class EvalCase:
    input: str
    expected_substring: str

async def regress(cases: list[EvalCase], answer_fn):
    fails = []
    for c in cases:
        out = await answer_fn(c.input)
        if c.expected_substring not in out:
            fails.append(c)
    return fails

Run on every PR before deploy. Fails block deploy. See LLM Evaluation .

Prompt versioning

PROMPT_V = "answer-v3"

@lf.observe()
def answer(question):
    p = lf.get_prompt(PROMPT_V)  # versioned prompt fetched from registry
    return llm.complete(p.compile(question=question))

When you change a prompt: bump version. Old traces still reference the old version. Compare quality across versions.

Regression detection

async def detect_regression():
    yesterday = await traces.where(date=yesterday)
    today = await traces.where(date=today)
    
    if today.error_rate > yesterday.error_rate * 1.5:
        alert("LLM error rate spike")
    if today.avg_latency > yesterday.avg_latency * 1.5:
        alert("LLM latency spike")
    if today.refusal_rate > 0.05:
        alert("LLM refusing too often")

Alert on quality regressions, not just outages.

User feedback

@app.post("/feedback")
async def feedback(trace_id: str, rating: int, comment: str = None):
    await lf.score(trace_id=trace_id, name="user_rating", value=rating, comment=comment)

Build feedback into the UI. Thumbs up / down per response. Feeds back into eval set.

Cost dashboards

SELECT
  feature,
  SUM(cost_usd) AS spend,
  COUNT(*) AS calls,
  AVG(cost_usd) AS cost_per_call
FROM llm_traces
WHERE ts > now() - interval '7 days'
GROUP BY feature
ORDER BY spend DESC;

The 80/20 of spend is usually obvious from this query. Optimize the top features. See LLM Cost Optimization .

Sampling

Capturing every prompt/response is expensive at scale. Sample:

  • 100% errors.
  • 100% refusals.
  • 10% normal traces.
  • 100% high-cost / high-latency outliers.

Adjust based on storage budget.

PII redaction

Logs include user input. User input may include PII. Redact before storage:

def redact(text):
    text = EMAIL_PATTERN.sub("[EMAIL]", text)
    text = PHONE_PATTERN.sub("[PHONE]", text)
    return text

trace.log(input=redact(user_text))

Per-region storage if GDPR / HIPAA requires.

Common mistakes

1. No tracing

“It just works.” Until it doesn’t, and you have nothing to debug.

2. Tracing without eval

Beautiful traces of garbage answers. Eval on traces.

3. No version on prompts

You change a prompt; all traces reference different prompts under the same name. Useless for comparison.

4. Logging full prompts to OSS observability tools

Sensitive data in third-party SaaS. Self-host or aggressively redact.

5. Ignoring user feedback

Thumbs-down rate climbing; nobody notices because no one checks. Pipe to alerting.

What I’d ship today

For a fresh LLM app:

  1. Langfuse self-hosted for tracing.
  2. OTEL for the rest of the stack; correlate via trace ID.
  3. Eval set of 50–200 cases run on every deploy.
  4. Daily eval on production traces sample.
  5. User feedback UI piped into traces.
  6. Cost dashboard by feature.
  7. Alerts on error rate, latency, refusal rate.

Read this next

If you want my Langfuse + eval pipeline starter, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .