Why is my observability bill so high?

Almost always cardinality (per-user / per-request labels) or volume (debug logs in prod). Audit cardinality first. SaaS observability bills can match infra bills for AI/SaaS apps.

SaaS for <50 engineers. Self-host (Prometheus + Loki + Tempo + Grafana) when bills exceed ~$100k/year. The break-even includes engineer time, not just licenses.

Observability Cost Control in 2026 — Cardinality, Sampling, and the Bills That Surprise You

Q: Self-host or SaaS?

SaaS for <50 engineers. Self-host (Prometheus + Loki + Tempo + Grafana) when bills exceed ~$100k/year. The break-even includes engineer time, not just licenses.

Observability bills surprise teams. The price model is “per metric series, per log byte, per trace span” — and engineers add labels generously. By month 6 the bill is unrecognizable. This post is the working playbook.

The cost shape

For Datadog / NewRelic / similar:

Logs: $/GB ingested + $/GB stored.
Metrics: $/100 series; cardinality multiplies fast.
Traces: $/span retained.
APM: $/host or $/container.

For self-hosted Loki / Prometheus / Tempo: storage + compute + your team’s time.

Cardinality

metric: http_request_duration{path="/api/users", status="200", method="GET"}

Each unique combination of label values = one series. With 100 paths × 5 statuses × 4 methods = 2000 series. Manageable.

But:

http_request_duration{path="/api/users/123", user_id="12345", session="abc"}

User IDs in labels = millions of series. Per minute, per metric. Bills explode.

Rule: labels must be low-cardinality enums. Per-user / per-request data goes to traces / logs, not metrics.

Cardinality audit

topk(20, count by (__name__)({__name__=~".+"}))
# top metrics by series count

In Prometheus / Mimir. Find the metrics with the most cardinality. Often: a forgotten label.

For Datadog: their “metrics summary” page shows series count per metric. Same exercise.

Log volume

The other budget killer.

log.debug("user data: %s", user.dict())  # in prod, in a hot path

Multiplied by 10k req/sec → terabytes per day.

Mitigations:

DEBUG off in prod.
Sample noisy events: if random.random() < 0.01:.
Aggregate instead of logging each occurrence: counter metric.
Don’t log full payloads. Hash, truncate, or redact.

See Python Logging 2026 .

Log retention tiers

Hot:    last 7 days, fully searchable, $$$
Warm:   8-30 days, slower search, $$
Cold:   30 days+, archive only, $

Most SaaS bill the hot tier; retention beyond that is cheaper. Don’t keep 90 days of debug logs hot.

For self-host Loki: object storage backend; cold retention is cheap.

Trace sampling

Tail-sampling at the OTEL Collector:

processors:
  tail_sampling:
    policies:
      - { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
      - { name: slow, type: latency, latency: { threshold_ms: 1000 } }
      - { name: random_5pct, type: probabilistic, probabilistic: { sampling_percentage: 5 } }

Keep 100% errors + slow; sample 5% of normal. Storage drops 95%; signal preserved.

See Distributed Tracing 2026 .

Metric aggregation

Don’t store raw points; pre-aggregate.

# Recording rule
- record: http_request_duration_p99_5m
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Pre-computed; storage and query both faster. Raw histograms only kept short-term.

Self-host break-even

SaaS: $100k/year for observability of a typical 30-engineer company.
Self-host: 1 SRE * $200k + infra ~$30k = $230k year 1; $230k/year ongoing if same SRE.

Self-host wins long-term IF observability is part of someone’s job and that infra is shared with other concerns. Otherwise SaaS is cheaper.

For <50 engineers: SaaS. Beyond: evaluate.

Open-source stacks

Prometheus + Mimir/Thanos — metrics.
Loki — logs.
Tempo — traces.
Grafana — viz.
OTEL Collector — pipelines.

All Grafana Labs. Coherent. Can self-host or use Grafana Cloud for hybrid.

Alternatives: VictoriaMetrics (metrics), SigNoz, Uptrace.

Common waste patterns

1. user_id in metric labels

Multiplies series by user count. Move to logs / traces.

2. trace_id / request_id in labels

Same problem. Each one is unique.

3. Logging full request bodies

JSON-serialized 10KB request → 10KB log line × 10k req/sec.

4. No log levels

Everything is INFO. DEBUG should be off in prod.

5. Free-form log messages

log.info(f"user {user_id} did {action} on {object}") — millions of unique strings, hard to query, expensive to index.

Audit framework

Quarterly:

Top 20 metrics by cardinality — anything weird?
Top log producers — anyone spamming?
Trace volume — sample rate appropriate?
Retention — are we over-retaining?
Unused dashboards / alerts — clean up.
Cost vs business growth — is the slope reasonable?

Tools

Datadog Cost Explorer — built-in.
Cribl — observability data pipeline; reduce volume in-flight.
Logstash / Fluent Bit — log filtering before ingest.
OTEL Collector — sample / drop / aggregate at the edge.

The pattern: process before sending. Drop debug; sample non-essential; aggregate where possible.

High-cardinality friendly platforms

For genuinely high-cardinality use cases (per-user, per-request analytics):

Honeycomb — designed for this; pay-per-event not pay-per-series.
Datadog — supports it but pricing punishes it.
ClickHouse — DIY; very high cardinality cheap.

Pick a platform that matches your cardinality reality.

Common mistakes

1. Tag everything with user_id

Metrics blow up. Logs/traces are the right place.

2. SaaS with no monitoring

You only know the bill when it arrives. Set spend alerts.

3. Log debug in prod

Volume → cost. Keep debug for dev/staging only.

4. Same retention for everything

Errors: keep long. Successes: keep short. Tier accordingly.

5. Add observability without removing

Old metrics never deleted; new ones added every quarter. Series count climbs forever. Quarterly cleanup.

What I’d ship today

For a typical SaaS:

OTEL Collector as ingest pipeline (sample / aggregate / drop).
Datadog or Honeycomb for SaaS, with cardinality alerts.
Self-host Loki + Tempo + Mimir when bills > $100k/year.
Quarterly cardinality audit.
Log levels enforced: WARN/ERROR in prod by default.
Tail-sampling for traces.
Retention tiers for logs.
Spend alerts in cloud + observability platforms.

Read this next

If you want my OTEL Collector configs (sampling + cardinality limits), it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

The cost shape#

Cardinality#

Cardinality audit#

Log volume#

Log retention tiers#

Trace sampling#

Metric aggregation#

Self-host break-even#

Open-source stacks#

Common waste patterns#

1. user_id in metric labels#

2. trace_id / request_id in labels#

3. Logging full request bodies#

4. No log levels#

5. Free-form log messages#

Audit framework#

Tools#

High-cardinality friendly platforms#

Common mistakes#

1. Tag everything with user_id#

2. SaaS with no monitoring#

3. Log debug in prod#

4. Same retention for everything#

5. Add observability without removing#

What I’d ship today#

Read this next#