Distributed tracing is the difference between guessing and knowing. When a request slows down, you want one view that shows where time was spent across all services. OpenTelemetry made this standard. This post is the working playbook.
What you get
Trace: GET /checkout
├─ http server: 1.2s
│ ├─ db: SELECT user 40ms
│ ├─ http client: GET /inventory 300ms
│ ├─ http client: POST /payment 600ms
│ │ └─ http server: POST /payment 580ms
│ │ └─ stripe.PaymentIntent.create 450ms
│ └─ db: INSERT order 80ms
One span per operation. Time spent visible. Causes obvious. No more “the API is slow somewhere.”
OpenTelemetry setup
# Python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.asyncpg import AsyncPGInstrumentor
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
AsyncPGInstrumentor().instrument()
Auto-instrumentation covers HTTP / DB / queues out of the box. Custom spans for business operations.
Custom spans
tracer = trace.get_tracer(__name__)
async def checkout(user_id):
with tracer.start_as_current_span("checkout") as span:
span.set_attribute("user_id", user_id)
with tracer.start_as_current_span("validate_cart"):
cart = await load_cart(user_id)
with tracer.start_as_current_span("charge"):
payment = await charge_user(user_id, cart.total)
span.set_attribute("payment.id", payment.id)
return payment
Each with is a span. Attributes attach context. Errors auto-recorded.
Context propagation
For tracing to span services, context must travel:
Service A: span starts → traceparent header set on outgoing HTTP
↓
Service B: traceparent header → continues the trace
OTEL handles this automatically for instrumented HTTP/gRPC clients. For custom transports (Kafka, queues): inject and extract manually.
# Inject (sender side)
from opentelemetry.propagate import inject
headers = {}
inject(headers)
producer.send(topic, payload, headers=headers)
# Extract (receiver side)
from opentelemetry.propagate import extract
ctx = extract(headers)
with tracer.start_as_current_span("process_message", context=ctx):
handle(payload)
Without propagation, you get disconnected per-service traces — much less useful.
Sampling
100% tracing of every request is expensive. Sample.
Head sampling
sampler = TraceIdRatioBased(0.1) # 10% of traces
Decide at the start. Cheap; can miss interesting events (errors).
Tail sampling (via OTEL Collector)
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 1000 }
- name: random_10pct
type: probabilistic
probabilistic: { sampling_percentage: 10 }
Collector buffers spans, decides at end. Keeps all error traces, all slow traces, 10% of normal. Best of both.
Useful attributes
Standard OTEL attributes:
http.method,http.status_code,http.urldb.system,db.statement,db.namemessaging.system,messaging.destination
App-specific:
user.idtenant.idfeature(e.g., “checkout”, “search”)experiment.variant
span.set_attribute("user.id", str(user_id))
span.set_attribute("feature", "checkout")
Allows querying: “show me checkout traces for tenant X with errors.”
Errors
try:
result = await operation()
except Exception as e:
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, str(e)))
raise
Auto-instrumentation does this for HTTP / DB. For business logic, do it explicitly.
Storage
| Strengths | |
|---|---|
| Tempo (Grafana) | OSS; cheap object-storage backend |
| Jaeger | OSS; mature; standalone |
| Datadog APM | SaaS; full APM features |
| Honeycomb | SaaS; high-cardinality friendly |
| New Relic | SaaS; APM |
For self-host: Tempo + Grafana. For SaaS: Honeycomb is excellent for ad-hoc trace querying.
Cost
Traces are heavy: a typical request emits 5-20 spans, each 1-5 KB. At 10k req/sec: tens of GB/day. Sampling matters.
Aim for:
- 100% errors and slow traces.
- 10% normal traces.
- Adjust sampling based on volume.
Trace + log correlation
import logging
from opentelemetry import trace
class TraceLogFilter(logging.Filter):
def filter(self, record):
span = trace.get_current_span()
ctx = span.get_span_context()
record.trace_id = format(ctx.trace_id, "032x") if ctx.trace_id else "-"
record.span_id = format(ctx.span_id, "016x") if ctx.span_id else "-"
return True
Log lines include trace_id; click from a log to its trace and back. Critical for debugging.
What to trace
- HTTP requests (in/out).
- DB queries.
- Cache lookups (with hit/miss attribute).
- External API calls.
- Queue produce / consume.
- Major business operations (checkout, signup, etc.).
Don’t trace:
- Tight loops (each iteration becoming a span).
- Trivial in-process calls.
- Things you’d never query.
Common mistakes
1. Tracing without context propagation
Each service has its own disconnected trace. Forgot to wire up the headers.
2. Too many spans
A trace with 10000 spans is unreadable. Keep major operations only.
3. No sampling
Tracing everything → trace storage bill bigger than infra bill.
4. PII in attributes
Names, emails, full SQL with values. Redact or hash.
5. No alerting on trace data
Beautiful traces; nobody looks. Alert on p99 latency, error rate per service.
What I’d ship today
For a new service:
- OTEL SDK + auto-instrumentation.
- Custom spans for major business operations.
- OTLP export to OTEL Collector.
- Tail sampling in collector (errors + slow + 10% random).
- Tempo or Honeycomb as backend.
- Trace IDs in logs.
- Grafana dashboards linking metrics → traces → logs.
Read this next
- Observability Stack 2026 — OTEL + Grafana
- Incident Response 2026
- Circuit Breakers 2026
- LLM Observability 2026
If you want my OTEL setup for FastAPI / Go / Node, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .