Observability bills surprise teams. The price model is “per metric series, per log byte, per trace span” — and engineers add labels generously. By month 6 the bill is unrecognizable. This post is the working playbook.

The cost shape

For Datadog / NewRelic / similar:

  • Logs: $/GB ingested + $/GB stored.
  • Metrics: $/100 series; cardinality multiplies fast.
  • Traces: $/span retained.
  • APM: $/host or $/container.

For self-hosted Loki / Prometheus / Tempo: storage + compute + your team’s time.

Cardinality

metric: http_request_duration{path="/api/users", status="200", method="GET"}

Each unique combination of label values = one series. With 100 paths × 5 statuses × 4 methods = 2000 series. Manageable.

But:

http_request_duration{path="/api/users/123", user_id="12345", session="abc"}

User IDs in labels = millions of series. Per minute, per metric. Bills explode.

Rule: labels must be low-cardinality enums. Per-user / per-request data goes to traces / logs, not metrics.

Cardinality audit

topk(20, count by (__name__)({__name__=~".+"}))
# top metrics by series count

In Prometheus / Mimir. Find the metrics with the most cardinality. Often: a forgotten label.

For Datadog: their “metrics summary” page shows series count per metric. Same exercise.

Log volume

The other budget killer.

log.debug("user data: %s", user.dict())  # in prod, in a hot path

Multiplied by 10k req/sec → terabytes per day.

Mitigations:

  • DEBUG off in prod.
  • Sample noisy events: if random.random() < 0.01:.
  • Aggregate instead of logging each occurrence: counter metric.
  • Don’t log full payloads. Hash, truncate, or redact.

See Python Logging 2026 .

Log retention tiers

Hot:    last 7 days, fully searchable, $$$
Warm:   8-30 days, slower search, $$
Cold:   30 days+, archive only, $

Most SaaS bill the hot tier; retention beyond that is cheaper. Don’t keep 90 days of debug logs hot.

For self-host Loki: object storage backend; cold retention is cheap.

Trace sampling

Tail-sampling at the OTEL Collector:

processors:
  tail_sampling:
    policies:
      - { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
      - { name: slow, type: latency, latency: { threshold_ms: 1000 } }
      - { name: random_5pct, type: probabilistic, probabilistic: { sampling_percentage: 5 } }

Keep 100% errors + slow; sample 5% of normal. Storage drops 95%; signal preserved.

See Distributed Tracing 2026 .

Metric aggregation

Don’t store raw points; pre-aggregate.

# Recording rule
- record: http_request_duration_p99_5m
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Pre-computed; storage and query both faster. Raw histograms only kept short-term.

Self-host break-even

SaaS: $100k/year for observability of a typical 30-engineer company.
Self-host: 1 SRE * $200k + infra ~$30k = $230k year 1; $230k/year ongoing if same SRE.

Self-host wins long-term IF observability is part of someone’s job and that infra is shared with other concerns. Otherwise SaaS is cheaper.

For <50 engineers: SaaS. Beyond: evaluate.

Open-source stacks

  • Prometheus + Mimir/Thanos — metrics.
  • Loki — logs.
  • Tempo — traces.
  • Grafana — viz.
  • OTEL Collector — pipelines.

All Grafana Labs. Coherent. Can self-host or use Grafana Cloud for hybrid.

Alternatives: VictoriaMetrics (metrics), SigNoz, Uptrace.

Common waste patterns

1. user_id in metric labels

Multiplies series by user count. Move to logs / traces.

2. trace_id / request_id in labels

Same problem. Each one is unique.

3. Logging full request bodies

JSON-serialized 10KB request → 10KB log line × 10k req/sec.

4. No log levels

Everything is INFO. DEBUG should be off in prod.

5. Free-form log messages

log.info(f"user {user_id} did {action} on {object}") — millions of unique strings, hard to query, expensive to index.

Audit framework

Quarterly:

  1. Top 20 metrics by cardinality — anything weird?
  2. Top log producers — anyone spamming?
  3. Trace volume — sample rate appropriate?
  4. Retention — are we over-retaining?
  5. Unused dashboards / alerts — clean up.
  6. Cost vs business growth — is the slope reasonable?

Tools

  • Datadog Cost Explorer — built-in.
  • Cribl — observability data pipeline; reduce volume in-flight.
  • Logstash / Fluent Bit — log filtering before ingest.
  • OTEL Collector — sample / drop / aggregate at the edge.

The pattern: process before sending. Drop debug; sample non-essential; aggregate where possible.

High-cardinality friendly platforms

For genuinely high-cardinality use cases (per-user, per-request analytics):

  • Honeycomb — designed for this; pay-per-event not pay-per-series.
  • Datadog — supports it but pricing punishes it.
  • ClickHouse — DIY; very high cardinality cheap.

Pick a platform that matches your cardinality reality.

Common mistakes

1. Tag everything with user_id

Metrics blow up. Logs/traces are the right place.

2. SaaS with no monitoring

You only know the bill when it arrives. Set spend alerts.

3. Log debug in prod

Volume → cost. Keep debug for dev/staging only.

4. Same retention for everything

Errors: keep long. Successes: keep short. Tier accordingly.

5. Add observability without removing

Old metrics never deleted; new ones added every quarter. Series count climbs forever. Quarterly cleanup.

What I’d ship today

For a typical SaaS:

  • OTEL Collector as ingest pipeline (sample / aggregate / drop).
  • Datadog or Honeycomb for SaaS, with cardinality alerts.
  • Self-host Loki + Tempo + Mimir when bills > $100k/year.
  • Quarterly cardinality audit.
  • Log levels enforced: WARN/ERROR in prod by default.
  • Tail-sampling for traces.
  • Retention tiers for logs.
  • Spend alerts in cloud + observability platforms.

Read this next

If you want my OTEL Collector configs (sampling + cardinality limits), it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .