Observability bills surprise teams. The price model is “per metric series, per log byte, per trace span” — and engineers add labels generously. By month 6 the bill is unrecognizable. This post is the working playbook.
The cost shape
For Datadog / NewRelic / similar:
- Logs: $/GB ingested + $/GB stored.
- Metrics: $/100 series; cardinality multiplies fast.
- Traces: $/span retained.
- APM: $/host or $/container.
For self-hosted Loki / Prometheus / Tempo: storage + compute + your team’s time.
Cardinality
metric: http_request_duration{path="/api/users", status="200", method="GET"}
Each unique combination of label values = one series. With 100 paths × 5 statuses × 4 methods = 2000 series. Manageable.
But:
http_request_duration{path="/api/users/123", user_id="12345", session="abc"}
User IDs in labels = millions of series. Per minute, per metric. Bills explode.
Rule: labels must be low-cardinality enums. Per-user / per-request data goes to traces / logs, not metrics.
Cardinality audit
topk(20, count by (__name__)({__name__=~".+"}))
# top metrics by series count
In Prometheus / Mimir. Find the metrics with the most cardinality. Often: a forgotten label.
For Datadog: their “metrics summary” page shows series count per metric. Same exercise.
Log volume
The other budget killer.
log.debug("user data: %s", user.dict()) # in prod, in a hot path
Multiplied by 10k req/sec → terabytes per day.
Mitigations:
- DEBUG off in prod.
- Sample noisy events:
if random.random() < 0.01:. - Aggregate instead of logging each occurrence: counter metric.
- Don’t log full payloads. Hash, truncate, or redact.
See Python Logging 2026 .
Log retention tiers
Hot: last 7 days, fully searchable, $$$
Warm: 8-30 days, slower search, $$
Cold: 30 days+, archive only, $
Most SaaS bill the hot tier; retention beyond that is cheaper. Don’t keep 90 days of debug logs hot.
For self-host Loki: object storage backend; cold retention is cheap.
Trace sampling
Tail-sampling at the OTEL Collector:
processors:
tail_sampling:
policies:
- { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
- { name: slow, type: latency, latency: { threshold_ms: 1000 } }
- { name: random_5pct, type: probabilistic, probabilistic: { sampling_percentage: 5 } }
Keep 100% errors + slow; sample 5% of normal. Storage drops 95%; signal preserved.
See Distributed Tracing 2026 .
Metric aggregation
Don’t store raw points; pre-aggregate.
# Recording rule
- record: http_request_duration_p99_5m
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Pre-computed; storage and query both faster. Raw histograms only kept short-term.
Self-host break-even
SaaS: $100k/year for observability of a typical 30-engineer company.
Self-host: 1 SRE * $200k + infra ~$30k = $230k year 1; $230k/year ongoing if same SRE.
Self-host wins long-term IF observability is part of someone’s job and that infra is shared with other concerns. Otherwise SaaS is cheaper.
For <50 engineers: SaaS. Beyond: evaluate.
Open-source stacks
- Prometheus + Mimir/Thanos — metrics.
- Loki — logs.
- Tempo — traces.
- Grafana — viz.
- OTEL Collector — pipelines.
All Grafana Labs. Coherent. Can self-host or use Grafana Cloud for hybrid.
Alternatives: VictoriaMetrics (metrics), SigNoz, Uptrace.
Common waste patterns
1. user_id in metric labels
Multiplies series by user count. Move to logs / traces.
2. trace_id / request_id in labels
Same problem. Each one is unique.
3. Logging full request bodies
JSON-serialized 10KB request → 10KB log line × 10k req/sec.
4. No log levels
Everything is INFO. DEBUG should be off in prod.
5. Free-form log messages
log.info(f"user {user_id} did {action} on {object}") — millions of unique strings, hard to query, expensive to index.
Audit framework
Quarterly:
- Top 20 metrics by cardinality — anything weird?
- Top log producers — anyone spamming?
- Trace volume — sample rate appropriate?
- Retention — are we over-retaining?
- Unused dashboards / alerts — clean up.
- Cost vs business growth — is the slope reasonable?
Tools
- Datadog Cost Explorer — built-in.
- Cribl — observability data pipeline; reduce volume in-flight.
- Logstash / Fluent Bit — log filtering before ingest.
- OTEL Collector — sample / drop / aggregate at the edge.
The pattern: process before sending. Drop debug; sample non-essential; aggregate where possible.
High-cardinality friendly platforms
For genuinely high-cardinality use cases (per-user, per-request analytics):
- Honeycomb — designed for this; pay-per-event not pay-per-series.
- Datadog — supports it but pricing punishes it.
- ClickHouse — DIY; very high cardinality cheap.
Pick a platform that matches your cardinality reality.
Common mistakes
1. Tag everything with user_id
Metrics blow up. Logs/traces are the right place.
2. SaaS with no monitoring
You only know the bill when it arrives. Set spend alerts.
3. Log debug in prod
Volume → cost. Keep debug for dev/staging only.
4. Same retention for everything
Errors: keep long. Successes: keep short. Tier accordingly.
5. Add observability without removing
Old metrics never deleted; new ones added every quarter. Series count climbs forever. Quarterly cleanup.
What I’d ship today
For a typical SaaS:
- OTEL Collector as ingest pipeline (sample / aggregate / drop).
- Datadog or Honeycomb for SaaS, with cardinality alerts.
- Self-host Loki + Tempo + Mimir when bills > $100k/year.
- Quarterly cardinality audit.
- Log levels enforced: WARN/ERROR in prod by default.
- Tail-sampling for traces.
- Retention tiers for logs.
- Spend alerts in cloud + observability platforms.
Read this next
If you want my OTEL Collector configs (sampling + cardinality limits), it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .