Observability for Backend Developers: Logs, Metrics, Traces

The first time a backend goes down in production, you realize how much you don’t know about your own system. What changed? Which user is affected? Is the database the bottleneck? Which version of the code is running? “Logs and metrics” stops being an abstract good idea and becomes the only way to find out.

This post is the practical observability guide for backend developers. We’ll cover the three pillars (logs, metrics, traces), when each one is the right tool, and the tooling that makes them work together. By the end you’ll know what to instrument and how — without drowning in vendor pages.

The three pillars (and what they’re each good at)

Logs

Discrete events with context. The detailed story.

Best for: “what happened on this specific request?”
Cost model: roughly proportional to volume. Each line is stored.
When to reach for them: debugging a specific incident.

Metrics

Numbers, aggregated over time. The dashboard story.

Best for: “is the system healthy right now? Has it ever been?”
Cost model: proportional to cardinality (number of unique label combinations), not raw data volume.
When to reach for them: dashboards, alerts, capacity planning.

Traces

End-to-end view of a single request as it crosses services.

Best for: “where in the call graph did time go?”
Cost model: sampling-friendly; usually cheap if you sample.
When to reach for them: distributed systems debugging, latency investigations.

You need all three. They’re not redundant; they answer different questions.

Logs: structured or it didn’t happen

The difference between logs that help and logs that don’t is structure. Plain text:

2026-04-28 10:31:22 ERROR Could not charge user 42 amount 100

Structured (JSON):

{"ts":"2026-04-28T10:31:22Z","level":"error","msg":"charge failed","user_id":42,"amount":100,"reason":"insufficient_funds","request_id":"abc-123"}

Why it matters: you can grep both. But you can only query the second one.

# In Loki / Elasticsearch / CloudWatch / etc:
{level="error"} | json | user_id="42"

In Python:

import structlog

log = structlog.get_logger()

log.info("charge_attempted", user_id=42, amount=100)
log.error("charge_failed", user_id=42, amount=100, reason="insufficient_funds")

In Go:

import "log/slog"

logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
logger.Info("charge_attempted", "user_id", 42, "amount", 100)
logger.Error("charge_failed", "user_id", 42, "amount", 100, "reason", "insufficient_funds")

Use a structured logger from day one. Plain print()/fmt.Println belongs in scripts, not services.

Add context

Every log entry should answer “for which request?”. Inject a request ID early in your middleware and attach it to every log line:

# FastAPI example
import uuid
from fastapi import Request
import structlog

@app.middleware("http")
async def add_request_id(request: Request, call_next):
    request_id = request.headers.get("x-request-id") or str(uuid.uuid4())
    structlog.contextvars.bind_contextvars(request_id=request_id)
    response = await call_next(request)
    response.headers["x-request-id"] = request_id
    structlog.contextvars.clear_contextvars()
    return response

Now every log.info(...) inside that request automatically includes the request ID. Trace through the logs of a failed request in seconds.

What to log

Errors and warnings with full context (user/tenant ID, parameters, error type).
Slow requests (>1s) — with timing breakdown.
Auth events — logins, failures, role changes. (At a level you can audit later.)
Data mutations at boundaries — payments, deletes, exports.

What NOT to log:

Secrets. Tokens, passwords, API keys, JWT contents — never log them. Audit your logs after a code review.
Every request. Your access log already does this; structured logs are for interesting events.
PII without thought — at minimum, scrub it for cross-team views.

Where logs go

Tiny scale: stdout/stderr → systemd journal → grep.
Small/medium scale: ship to a hosted service (Better Stack, Papertrail, Logtail).
Production: centralized aggregation — Loki (lightweight, integrates with Prometheus/Grafana), Elasticsearch/OpenSearch (powerful but heavy), CloudWatch / Cloud Logging (managed, cloud-locked).

Don’t try to grep tens of GBs of logs over SSH at 3 AM. You will lose.

Metrics: dashboards and alerts live here

Metrics are time series — for each name, a series of (timestamp, value) pairs, optionally with labels.

http_requests_total{method="GET", path="/users", status="200"} = 12483
http_request_duration_seconds_bucket{path="/users", le="0.1"} = 11200
db_connections_in_use{pool="primary"} = 7

In 2026, the de facto standard is Prometheus (or a Prometheus-compatible system: VictoriaMetrics, Mimir, Cortex, AWS Managed Prometheus).

The four metric types

Counter — only goes up. http_requests_total. Use rate() to get per-second rate.
Gauge — can go up or down. queue_depth, db_connections_in_use, memory_bytes.
Histogram — buckets of observations. http_request_duration_seconds. Lets you compute percentiles (p50, p95, p99).
Summary — percentiles pre-computed at the source. Less flexible than histograms; usually prefer histograms.

What to instrument: RED and USE

Two acronyms that cover what you actually need:

RED for services: Rate, Errors, Duration. For every endpoint.
USE for resources: Utilization, Saturation, Errors. For CPU, memory, disk, network, DB connections.

If you have RED on every service and USE on every resource, you can debug almost any production issue.

Avoid the cardinality trap

Every unique label-value combination is a separate time series.

# BAD — user_id can be millions of values
http_requests_total{path="/users", user_id="42"}

# GOOD — bucketed status
http_requests_total{path="/users", status="200"}

A few hundred unique values per label is fine. A few million will OOM your metrics store. Never use unbounded values (user IDs, request IDs, error messages) as labels.

Instrumenting a Python app

from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST


REQUEST_COUNT = Counter(
    "http_requests_total", "Total HTTP requests",
    ["method", "path", "status"],
)
REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds", "HTTP request latency",
    ["method", "path"],
)


@app.middleware("http")
async def metrics_middleware(request, call_next):
    with REQUEST_LATENCY.labels(request.method, request.url.path).time():
        response = await call_next(request)
    REQUEST_COUNT.labels(request.method, request.url.path, response.status_code).inc()
    return response


@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

Prometheus scrapes /metrics every few seconds. Grafana queries Prometheus to render dashboards.

Alerts

Metrics only matter if someone gets paged when they go bad. A few alerts every backend should have:

Error rate > X% for Y minutes — rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
p99 latency > X ms for Y minutes — histogram_quantile(0.99, ...) > 1.0
Database connections saturated — db_connections_in_use / db_pool_size > 0.9
Disk space < 10%
Process restarts — your service crash-looping
Queue depth growing unbounded (for Celery/RQ/etc.)

Tune until you trust them. False pages are the fastest way to make a team ignore real ones.

Traces: where did the time go?

In a single-service app, the slow part is usually obvious. In a microservice setup, “slow API” could be any of 12 services, 3 caches, 2 databases, or the network in between.

Distributed tracing assigns each request a trace ID that’s propagated across all services it touches. Each span (a unit of work) records start time, duration, and parent span. The trace UI shows you a flame graph of the whole request.

[ ────────────── /api/orders (180ms) ──────────────────────── ]
   [ auth (12ms) ]
                   [ ── db: load order (90ms) ── ]
                                                  [ stripe (60ms) ]
                                                                    [ db: write log (5ms) ]

Now “this endpoint is slow” becomes “Stripe is the bottleneck.”

OpenTelemetry: the standard

OpenTelemetry (OTel) is the vendor-neutral standard for traces (and metrics, and logs). Most modern services support it natively.

In Python:

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

Run with auto-instrumentation:

opentelemetry-instrument \
  --traces_exporter otlp \
  --metrics_exporter otlp \
  --service_name my-api \
  uvicorn app.main:app

It auto-instruments common libraries (FastAPI, requests, httpx, SQLAlchemy, psycopg, redis, etc.). Send to Jaeger, Tempo, Honeycomb, Datadog APM — the OTLP wire protocol works with all of them.

Manual instrumentation when you need it:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def expensive():
    with tracer.start_as_current_span("compute_thing") as span:
        span.set_attribute("input.size", len(data))
        result = do_work(data)
        span.set_attribute("output.size", len(result))
        return result

Sampling

Tracing every request at scale is expensive and noisy. Sample:

Head-based sampling — decide at the request entry point (e.g. 1% of all requests).
Tail-based sampling — collect all traces, decide whether to keep them after they finish (keep all errors, slow requests, sample fast ones).

Tail-based is better quality but harder operationally. Head-based at 1-10% is fine for most teams.

The tooling stack you probably want

For a typical small-to-medium backend in 2026:

Logs: Loki + Grafana, or a hosted alternative.
Metrics: Prometheus + Grafana, or a hosted Prometheus.
Traces: Tempo + Grafana, or Jaeger.
Errors: Sentry. (Yes, even with logs and traces — Sentry’s automatic grouping is unmatched.)
Uptime: Better Stack, UptimeRobot, Pingdom — external pings.

The “Grafana stack” (Loki, Prometheus, Tempo, Mimir, all visualized in Grafana) is the strongest open-source story. Hosted options (Grafana Cloud, Datadog, New Relic, Honeycomb) trade money for less ops.

For a small team, start with Sentry + a hosted log service + a hosted metrics service. The unit cost is low; the time saved is enormous. Self-host when scale demands it.

Health checks: the entry point to all of this

Your service should expose:

/healthz — fast liveness. Returns 200 if the process is up.
/readyz — readiness. Returns 200 only if the service can actually serve traffic (DB reachable, caches warm).
/metrics — Prometheus scrape endpoint.

Load balancers and orchestrators (Kubernetes, Nomad) use the first two; Prometheus uses the third.

A “minimum viable observability” checklist

For any service going to production:

Structured logs (JSON) at INFO level by default.
Request ID middleware that attaches an ID to every log line in a request.
/healthz and /readyz endpoints.
/metrics endpoint with at least RED metrics for HTTP and queue depth for any background workers.
Alerts for: 5xx rate, p99 latency, DB connection saturation, disk space.
Sentry (or equivalent) for unhandled exceptions, with release tagging tied to your CI version.
Uptime monitor pinging /healthz from outside your network.

That’s the minimum. Most teams stop here for a long time and that’s okay.

Conclusion

Observability is the difference between debugging in production with confidence and guessing in a panic. Three pillars, complementary not redundant: logs for the story, metrics for the dashboard, traces for the call graph. Get the basics in place from day one — it’s far easier than retrofitting after an incident.

For more on the platform layer, see Kubernetes for App Developers . For the deploy pipeline, see GitHub Actions CI/CD for Python Apps .

Happy observing!

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

The three pillars (and what they’re each good at)#

Logs#

Metrics#

Traces#

Logs: structured or it didn’t happen#

Add context#

What to log#

Where logs go#

Metrics: dashboards and alerts live here#

The four metric types#

What to instrument: RED and USE#

Avoid the cardinality trap#

Instrumenting a Python app#

Alerts#

Traces: where did the time go?#

OpenTelemetry: the standard#

Sampling#

The tooling stack you probably want#

Health checks: the entry point to all of this#

A “minimum viable observability” checklist#

Conclusion#