Tracing or logging — which first?

Logging gives you 'what happened in this service.' Tracing gives you 'what happened across services.' Most teams need both. Start with structured logging, add tracing when service-to-service issues become hard to debug.

Head sampling or tail sampling?

Head sampling (decide at trace start) is simpler but blind to outcomes. Tail sampling (decide at trace end) keeps interesting traces (errors, slow) but needs a smart collector. For mature setups: tail.

Distributed Tracing in 2026 — OpenTelemetry, Trace Context, and What Actually Helps Debugging

Distributed tracing is the difference between guessing and knowing. When a request slows down, you want one view that shows where time was spent across all services. OpenTelemetry made this standard. This post is the working playbook.

What you get

Trace: GET /checkout
├─ http server: 1.2s
│  ├─ db: SELECT user      40ms
│  ├─ http client: GET /inventory  300ms
│  ├─ http client: POST /payment   600ms
│  │  └─ http server: POST /payment 580ms
│  │     └─ stripe.PaymentIntent.create  450ms
│  └─ db: INSERT order      80ms

One span per operation. Time spent visible. Causes obvious. No more “the API is slow somewhere.”

OpenTelemetry setup

# Python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.asyncpg import AsyncPGInstrumentor

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)

FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
AsyncPGInstrumentor().instrument()

Auto-instrumentation covers HTTP / DB / queues out of the box. Custom spans for business operations.

Custom spans

tracer = trace.get_tracer(__name__)

async def checkout(user_id):
    with tracer.start_as_current_span("checkout") as span:
        span.set_attribute("user_id", user_id)
        
        with tracer.start_as_current_span("validate_cart"):
            cart = await load_cart(user_id)
        
        with tracer.start_as_current_span("charge"):
            payment = await charge_user(user_id, cart.total)
            span.set_attribute("payment.id", payment.id)
        
        return payment

Each with is a span. Attributes attach context. Errors auto-recorded.

Context propagation

For tracing to span services, context must travel:

Service A: span starts → traceparent header set on outgoing HTTP
                                 ↓
Service B: traceparent header → continues the trace

OTEL handles this automatically for instrumented HTTP/gRPC clients. For custom transports (Kafka, queues): inject and extract manually.

# Inject (sender side)
from opentelemetry.propagate import inject
headers = {}
inject(headers)
producer.send(topic, payload, headers=headers)

# Extract (receiver side)
from opentelemetry.propagate import extract
ctx = extract(headers)
with tracer.start_as_current_span("process_message", context=ctx):
    handle(payload)

Without propagation, you get disconnected per-service traces — much less useful.

Sampling

100% tracing of every request is expensive. Sample.

Head sampling

sampler = TraceIdRatioBased(0.1)  # 10% of traces

Decide at the start. Cheap; can miss interesting events (errors).

Tail sampling (via OTEL Collector)

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: random_10pct
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

Collector buffers spans, decides at end. Keeps all error traces, all slow traces, 10% of normal. Best of both.

Useful attributes

Standard OTEL attributes:

http.method, http.status_code, http.url
db.system, db.statement, db.name
messaging.system, messaging.destination

App-specific:

user.id
tenant.id
feature (e.g., “checkout”, “search”)
experiment.variant

span.set_attribute("user.id", str(user_id))
span.set_attribute("feature", "checkout")

Allows querying: “show me checkout traces for tenant X with errors.”

Errors

try:
    result = await operation()
except Exception as e:
    span.record_exception(e)
    span.set_status(Status(StatusCode.ERROR, str(e)))
    raise

Auto-instrumentation does this for HTTP / DB. For business logic, do it explicitly.

Storage

	Strengths
Tempo (Grafana)	OSS; cheap object-storage backend
Jaeger	OSS; mature; standalone
Datadog APM	SaaS; full APM features
Honeycomb	SaaS; high-cardinality friendly
New Relic	SaaS; APM

For self-host: Tempo + Grafana. For SaaS: Honeycomb is excellent for ad-hoc trace querying.

Cost

Traces are heavy: a typical request emits 5-20 spans, each 1-5 KB. At 10k req/sec: tens of GB/day. Sampling matters.

Aim for:

100% errors and slow traces.
10% normal traces.
Adjust sampling based on volume.

Trace + log correlation

import logging
from opentelemetry import trace

class TraceLogFilter(logging.Filter):
    def filter(self, record):
        span = trace.get_current_span()
        ctx = span.get_span_context()
        record.trace_id = format(ctx.trace_id, "032x") if ctx.trace_id else "-"
        record.span_id = format(ctx.span_id, "016x") if ctx.span_id else "-"
        return True

Log lines include trace_id; click from a log to its trace and back. Critical for debugging.

What to trace

HTTP requests (in/out).
DB queries.
Cache lookups (with hit/miss attribute).
External API calls.
Queue produce / consume.
Major business operations (checkout, signup, etc.).

Don’t trace:

Tight loops (each iteration becoming a span).
Trivial in-process calls.
Things you’d never query.

Common mistakes

1. Tracing without context propagation

Each service has its own disconnected trace. Forgot to wire up the headers.

2. Too many spans

A trace with 10000 spans is unreadable. Keep major operations only.

3. No sampling

Tracing everything → trace storage bill bigger than infra bill.

4. PII in attributes

Names, emails, full SQL with values. Redact or hash.

5. No alerting on trace data

Beautiful traces; nobody looks. Alert on p99 latency, error rate per service.

What I’d ship today

For a new service:

OTEL SDK + auto-instrumentation.
Custom spans for major business operations.
OTLP export to OTEL Collector.
Tail sampling in collector (errors + slow + 10% random).
Tempo or Honeycomb as backend.
Trace IDs in logs.
Grafana dashboards linking metrics → traces → logs.

Read this next

If you want my OTEL setup for FastAPI / Go / Node, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

What you get#

OpenTelemetry setup#

Custom spans#

Context propagation#

Sampling#

Head sampling#

Tail sampling (via OTEL Collector)#

Useful attributes#

Errors#

Storage#

Cost#

Trace + log correlation#

What to trace#

Common mistakes#

1. Tracing without context propagation#

2. Too many spans#

3. No sampling#

4. PII in attributes#

5. No alerting on trace data#

What I’d ship today#

Read this next#