SLOs (Service Level Objectives) and error budgets are the SRE discipline that translates “we want it reliable” into engineering decisions. Done well, they end the eternal debate between “ship features” and “improve reliability.” This post is the working playbook.
SLI, SLO, SLA
- SLI: Service Level Indicator — a metric. “% of HTTP requests completing successfully under 500ms.”
- SLO: Service Level Objective — the target. “99.9% over 30 days.”
- SLA: Service Level Agreement — a contract with consequences. “99.5% or you get a refund.”
Most engineering decisions live at the SLO level. SLAs are legal.
Pick the right SLI
For user-facing services:
- Availability: % of requests completing without 5xx.
- Latency: % of requests under N ms.
- Quality: % of correct responses (for things like search, recommendations).
For data pipelines:
- Freshness: % of data delivered within the freshness target.
- Completeness: % of data points received vs expected.
Pick 2–4 SLIs per service. Too many SLIs = nobody pays attention to any.
Target setting
Walk through your past 30 days of data:
Past month: 99.94% availability.
Question: would 99.9% be 'good enough' for users?
Answer: yes, our last incident wasn't because of "uptime."
SLO: 99.9%.
Don’t pick aspirational targets you can’t currently hit. Set the SLO at “current performance, slightly higher than the worst real month.”
Error budget
If your SLO is 99.9% over 30 days:
Budget = 0.1% of all events.
At 1M requests/day = 30M/month.
Budget = 30,000 failed requests.
Per minute: ~7 failures allowed. Per hour: ~420.
The budget is a finite resource. Spend it wisely.
Burn rate alerting
Don’t alert on “we’re at 99.85% in the last hour.” Alert on burn rate:
fast burn: consuming 5% of monthly budget in 1 hour → page
slow burn: consuming 10% of monthly budget in 6 hours → ticket
Multi-window multi-burn-rate alerts (Google’s SRE workbook). Catches both sudden outages and slow regressions.
# Prometheus example
- alert: HighErrorBudgetBurn
expr: |
(sum(rate(http_requests_total{status=~"5.."}[1h])) /
sum(rate(http_requests_total[1h]))) > 0.001 * 14.4
for: 2m
labels: { severity: page }
14.4 = burn rate that exhausts a 30-day budget in 2 days. Tune by SLO.
Error budget policy
When the budget is depleted:
Budget remaining: 0%
Action:
- Freeze feature deploys (only bug fixes / reliability work).
- Reliability work prioritized until budget recovers.
- Postmortem on what consumed the budget.
This is the contract. Without it, the SLO is theatre.
When the budget is healthy:
Budget remaining: >50%
Action:
- Ship features confidently.
- Take controlled risk on new launches.
- Use the budget — don't hoard it.
Hoarding budget = under-shipping. The budget exists to be spent.
SLOs vs alerts
Bad: alert on “service down.”
Good: alert on “burning budget too fast.”
Old-school monitoring: alert on every blip; people get paged; tunes them out. SLO-based: alert when it actually matters to users.
What’s NOT a good SLO
- CPU usage — internal, doesn’t reflect user impact.
- Number of restarts — internal.
- Disk usage — internal.
These are useful operational metrics, but they’re not SLOs. Users don’t care if your CPU is at 80%; they care if their request worked.
Tools
| Strengths | |
|---|---|
| Sloth | Generate Prometheus rules from SLO YAML |
| Pyrra | Similar; CRDs for K8s |
| Datadog SLOs | Built-in if you’re on Datadog |
| Honeycomb / Grafana Cloud | Built-in |
| Custom | Just build it on your existing metrics |
For self-hosted Prometheus shops: Sloth or Pyrra.
Capacity planning via SLO
Run load tests to find the throughput at which your service’s latency SLI starts breaching the SLO.
At 10k RPS: p99 = 200ms (SLO compliant)
At 20k RPS: p99 = 800ms (SLO breach)
That’s your capacity ceiling per replica. Scale horizontally before you hit it.
Per-customer SLOs
For multi-tenant SaaS, the global SLO can hide tenant-specific issues. Track per-tier:
Free tier: 99.9%
Pro tier: 99.95%
Enterprise: 99.99%
Different SLOs justify different infrastructure investment.
Common mistakes
1. Pick “5 nines” because it sounds good
99.999% = 5 minutes/year. Costs 100× more than 99.9%. Users don’t notice. Engineering vanity.
2. SLOs nobody acts on
Pretty dashboards; budget exhausted; nobody freezes. The SLO is performative. Make the policy real.
3. Bad SLI
“All metrics” SLO that mixes latency and availability and CPU. Pick clean SLIs that map to user experience.
4. No burn rate alerts
Alert at “below SLO” → too late. Alert on burn rate → catches early.
5. SLOs without product input
Engineering picks 99.99% because it’s high; product wanted 99% because users don’t care. Talk to product.
What I’d ship today
For a service:
- 2–3 SLIs: availability, latency, maybe quality.
- 30-day SLOs at realistic targets.
- Sloth-generated burn rate alerts.
- Error budget policy documented and enforced.
- Quarterly review of SLOs (still meaningful?).
- Capacity ceilings based on SLI breach point.
Read this next
If you want my SLO definitions + Prometheus rules starter, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .