LLM apps fail differently from regular services. The HTTP returns 200; the response is wrong. Latency is fine; tokens overspend. No errors; users complain. You can’t debug what you can’t see. This post is the working playbook for visibility.
What to capture
For every LLM call:
- Prompt (system + messages, with variable substitutions resolved).
- Response (full text or tool call).
- Token counts (input cached, input fresh, output).
- Latency (TTFB, total).
- Model + version.
- Cost (computed from tokens × price).
- User / session ID.
- Trace ID (links to other spans).
- Tool calls (each tool invocation, args, result).
- Status (success / error / refused).
Save it all. Storage is cheap; debugging without it is impossible.
Tools
| Strengths | |
|---|---|
| Langfuse | OSS-friendly, self-hostable, prompt versioning |
| Arize Phoenix | OSS, evals + tracing, OpenInference |
| Helicone | Cheapest setup; proxy-based |
| LangSmith | LangChain-native; mature |
| Datadog LLM Obs | Existing DD shops; integrates with APM |
| Hand-rolled | OTEL spans + a Postgres table |
For most teams: Langfuse (OSS) or Arize Phoenix (OSS). For enterprise: LangSmith or Datadog.
Langfuse trace
from langfuse import Langfuse
lf = Langfuse()
trace = lf.trace(name="chat-handler", user_id=user.id, session_id=session_id)
gen = trace.generation(
name="answer",
model="claude-sonnet-4-6",
input=messages,
)
resp = await client.messages.create(...)
gen.end(
output=resp.content[0].text,
usage={"input": resp.usage.input_tokens, "output": resp.usage.output_tokens},
)
Web UI: see every trace, drill into prompts/responses, replay, fork prompts.
OpenInference + OTEL
For OTEL shops:
from openinference.instrumentation.anthropic import AnthropicInstrumentor
AnthropicInstrumentor().instrument()
# Now every Anthropic call emits OTEL spans with LLM-specific attributes
Spans include llm.model_name, llm.token_count.input, etc. Send to Phoenix / Datadog / Tempo.
Eval pipelines
Tracing without evals = beautiful logs of bad answers. Combine.
async def eval_set(traces, judge):
results = []
for t in traces:
score = await judge(t.input, t.output)
results.append({"trace_id": t.id, "score": score.score, "reason": score.reason})
return results
Run on a subset of production traces nightly. Score with an LLM judge or rule-based check.
@dataclass
class EvalCase:
input: str
expected_substring: str
async def regress(cases: list[EvalCase], answer_fn):
fails = []
for c in cases:
out = await answer_fn(c.input)
if c.expected_substring not in out:
fails.append(c)
return fails
Run on every PR before deploy. Fails block deploy. See LLM Evaluation .
Prompt versioning
PROMPT_V = "answer-v3"
@lf.observe()
def answer(question):
p = lf.get_prompt(PROMPT_V) # versioned prompt fetched from registry
return llm.complete(p.compile(question=question))
When you change a prompt: bump version. Old traces still reference the old version. Compare quality across versions.
Regression detection
async def detect_regression():
yesterday = await traces.where(date=yesterday)
today = await traces.where(date=today)
if today.error_rate > yesterday.error_rate * 1.5:
alert("LLM error rate spike")
if today.avg_latency > yesterday.avg_latency * 1.5:
alert("LLM latency spike")
if today.refusal_rate > 0.05:
alert("LLM refusing too often")
Alert on quality regressions, not just outages.
User feedback
@app.post("/feedback")
async def feedback(trace_id: str, rating: int, comment: str = None):
await lf.score(trace_id=trace_id, name="user_rating", value=rating, comment=comment)
Build feedback into the UI. Thumbs up / down per response. Feeds back into eval set.
Cost dashboards
SELECT
feature,
SUM(cost_usd) AS spend,
COUNT(*) AS calls,
AVG(cost_usd) AS cost_per_call
FROM llm_traces
WHERE ts > now() - interval '7 days'
GROUP BY feature
ORDER BY spend DESC;
The 80/20 of spend is usually obvious from this query. Optimize the top features. See LLM Cost Optimization .
Sampling
Capturing every prompt/response is expensive at scale. Sample:
- 100% errors.
- 100% refusals.
- 10% normal traces.
- 100% high-cost / high-latency outliers.
Adjust based on storage budget.
PII redaction
Logs include user input. User input may include PII. Redact before storage:
def redact(text):
text = EMAIL_PATTERN.sub("[EMAIL]", text)
text = PHONE_PATTERN.sub("[PHONE]", text)
return text
trace.log(input=redact(user_text))
Per-region storage if GDPR / HIPAA requires.
Common mistakes
1. No tracing
“It just works.” Until it doesn’t, and you have nothing to debug.
2. Tracing without eval
Beautiful traces of garbage answers. Eval on traces.
3. No version on prompts
You change a prompt; all traces reference different prompts under the same name. Useless for comparison.
4. Logging full prompts to OSS observability tools
Sensitive data in third-party SaaS. Self-host or aggressively redact.
5. Ignoring user feedback
Thumbs-down rate climbing; nobody notices because no one checks. Pipe to alerting.
What I’d ship today
For a fresh LLM app:
- Langfuse self-hosted for tracing.
- OTEL for the rest of the stack; correlate via trace ID.
- Eval set of 50–200 cases run on every deploy.
- Daily eval on production traces sample.
- User feedback UI piped into traces.
- Cost dashboard by feature.
- Alerts on error rate, latency, refusal rate.
Read this next
If you want my Langfuse + eval pipeline starter, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .