SLOs (Service Level Objectives) and error budgets are the SRE discipline that translates “we want it reliable” into engineering decisions. Done well, they end the eternal debate between “ship features” and “improve reliability.” This post is the working playbook.

SLI, SLO, SLA

  • SLI: Service Level Indicator — a metric. “% of HTTP requests completing successfully under 500ms.”
  • SLO: Service Level Objective — the target. “99.9% over 30 days.”
  • SLA: Service Level Agreement — a contract with consequences. “99.5% or you get a refund.”

Most engineering decisions live at the SLO level. SLAs are legal.

Pick the right SLI

For user-facing services:

  • Availability: % of requests completing without 5xx.
  • Latency: % of requests under N ms.
  • Quality: % of correct responses (for things like search, recommendations).

For data pipelines:

  • Freshness: % of data delivered within the freshness target.
  • Completeness: % of data points received vs expected.

Pick 2–4 SLIs per service. Too many SLIs = nobody pays attention to any.

Target setting

Walk through your past 30 days of data:

Past month: 99.94% availability.
Question: would 99.9% be 'good enough' for users?
Answer: yes, our last incident wasn't because of "uptime."
SLO: 99.9%.

Don’t pick aspirational targets you can’t currently hit. Set the SLO at “current performance, slightly higher than the worst real month.”

Error budget

If your SLO is 99.9% over 30 days:

Budget = 0.1% of all events.
At 1M requests/day = 30M/month.
Budget = 30,000 failed requests.

Per minute: ~7 failures allowed. Per hour: ~420.

The budget is a finite resource. Spend it wisely.

Burn rate alerting

Don’t alert on “we’re at 99.85% in the last hour.” Alert on burn rate:

fast burn:  consuming 5% of monthly budget in 1 hour    →  page
slow burn:  consuming 10% of monthly budget in 6 hours  →  ticket

Multi-window multi-burn-rate alerts (Google’s SRE workbook). Catches both sudden outages and slow regressions.

# Prometheus example
- alert: HighErrorBudgetBurn
  expr: |
    (sum(rate(http_requests_total{status=~"5.."}[1h])) /
     sum(rate(http_requests_total[1h]))) > 0.001 * 14.4
  for: 2m
  labels: { severity: page }

14.4 = burn rate that exhausts a 30-day budget in 2 days. Tune by SLO.

Error budget policy

When the budget is depleted:

Budget remaining: 0%
Action:
- Freeze feature deploys (only bug fixes / reliability work).
- Reliability work prioritized until budget recovers.
- Postmortem on what consumed the budget.

This is the contract. Without it, the SLO is theatre.

When the budget is healthy:

Budget remaining: >50%
Action:
- Ship features confidently.
- Take controlled risk on new launches.
- Use the budget — don't hoard it.

Hoarding budget = under-shipping. The budget exists to be spent.

SLOs vs alerts

Bad: alert on “service down.”

Good: alert on “burning budget too fast.”

Old-school monitoring: alert on every blip; people get paged; tunes them out. SLO-based: alert when it actually matters to users.

What’s NOT a good SLO

  • CPU usage — internal, doesn’t reflect user impact.
  • Number of restarts — internal.
  • Disk usage — internal.

These are useful operational metrics, but they’re not SLOs. Users don’t care if your CPU is at 80%; they care if their request worked.

Tools

Strengths
SlothGenerate Prometheus rules from SLO YAML
PyrraSimilar; CRDs for K8s
Datadog SLOsBuilt-in if you’re on Datadog
Honeycomb / Grafana CloudBuilt-in
CustomJust build it on your existing metrics

For self-hosted Prometheus shops: Sloth or Pyrra.

Capacity planning via SLO

Run load tests to find the throughput at which your service’s latency SLI starts breaching the SLO.

At 10k RPS: p99 = 200ms (SLO compliant)
At 20k RPS: p99 = 800ms (SLO breach)

That’s your capacity ceiling per replica. Scale horizontally before you hit it.

Per-customer SLOs

For multi-tenant SaaS, the global SLO can hide tenant-specific issues. Track per-tier:

Free tier:    99.9%
Pro tier:     99.95%
Enterprise:   99.99%

Different SLOs justify different infrastructure investment.

Common mistakes

1. Pick “5 nines” because it sounds good

99.999% = 5 minutes/year. Costs 100× more than 99.9%. Users don’t notice. Engineering vanity.

2. SLOs nobody acts on

Pretty dashboards; budget exhausted; nobody freezes. The SLO is performative. Make the policy real.

3. Bad SLI

“All metrics” SLO that mixes latency and availability and CPU. Pick clean SLIs that map to user experience.

4. No burn rate alerts

Alert at “below SLO” → too late. Alert on burn rate → catches early.

5. SLOs without product input

Engineering picks 99.99% because it’s high; product wanted 99% because users don’t care. Talk to product.

What I’d ship today

For a service:

  • 2–3 SLIs: availability, latency, maybe quality.
  • 30-day SLOs at realistic targets.
  • Sloth-generated burn rate alerts.
  • Error budget policy documented and enforced.
  • Quarterly review of SLOs (still meaningful?).
  • Capacity ceilings based on SLI breach point.

Read this next

If you want my SLO definitions + Prometheus rules starter, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .