The Google SRE book is excellent and intimidating. This post is the working summary every backend developer should have read by now. Three terms, one idea, one lever. Once you internalize it, the conversation between “ship features” and “fix reliability” stops being a religion.
SLIs, SLOs, SLAs — what each one is
SLI — Service Level Indicator. A measurement of one aspect of service quality. Usually a ratio: good_events / total_events.
Example: of all HTTP requests in the last hour, what fraction returned 2xx within 500 ms?
SLO — Service Level Objective. A target for an SLI over a window.
Example: 99.9% of requests in any 30-day window return 2xx within 500 ms.
SLA — Service Level Agreement. A contract. The SLO surfaced to a customer with consequences. Usually a number lower than the SLO (“we promise 99.5%, we target 99.9%”).
You build internal SLOs. Sales sells SLAs. Don’t confuse them.
What an error budget is
If your SLO is 99.9%, your error budget is 0.1%. That’s the share of bad events you’re allowed.
For a service handling 10M requests/month at a 99.9% SLO:
budget = 10,000,000 × 0.001 = 10,000 bad requests
Ten thousand bad requests over the month is fine. It’s expected. It’s why you set 99.9% and not 100%.
The budget is a currency. You spend it on:
- Risky deploys
- Experiments with new features
- Capacity reductions
- Incidents
You earn it back by:
- Not deploying badly
- Adding reliability work
- Time passing
When the budget is healthy, ship features aggressively. When it’s exhausted, freeze deploys until you’ve earned it back. This is the entire point. It replaces opinions with arithmetic.
Picking the right SLI
The SRE book has an exhaustive taxonomy. In practice, ~95% of services need just two SLIs:
1. Availability — “did the user get an answer at all?”
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Anything 5xx or no-response counts as bad. Anything else counts as good. 4xx is not bad — that’s the user’s fault, not yours.
2. Latency — “did the user get the answer fast enough?”
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
The fraction of requests that completed within 500 ms. Anything slower is bad.
Pick one threshold, not p50/p95/p99. SLO arithmetic gets confusing fast. The threshold is “the latency above which the user notices.” Most consumer-facing APIs land between 300 ms and 1 s.
For a payment service: maybe <200ms is the line. For a video render service: maybe <60s. Pick what matches the user’s actual perception.
Setting a target you’ll defend
Pick a number you would defend in an incident review. Not the highest you can imagine, not 99.999% because it sounds impressive.
Reasonable defaults:
| Service criticality | Availability | Latency target |
|---|---|---|
| Internal experiment | 99% | 1 s |
| Standard backend | 99.5% | 500 ms |
| Customer-facing API | 99.9% | 300 ms |
| Payment / auth | 99.95% | 200 ms |
| Single point of failure | 99.99% | 100 ms |
Don’t go higher than 99.99% lightly. Each extra nine costs 10x effort. Most engineering teams set 99.99% on their auth service and quietly miss it every month, which means the SLO isn’t doing anything.
Honesty rule: if you’ve never hit your SLO target in the last 90 days, your target is wrong, not your service.
Calculating the budget
The budget over a window of N events at SLO s:
budget = N × (1 - s)
In time terms (a 30-day month = 43,200 minutes):
| SLO | Allowed downtime / 30 days |
|---|---|
| 99% | 432 min (7.2 h) |
| 99.5% | 216 min (3.6 h) |
| 99.9% | 43.2 min |
| 99.95% | 21.6 min |
| 99.99% | 4.32 min |
These are the numbers that should set your panic threshold during incidents. If your SLO is 99.9% and an incident has been running for 35 minutes, you have 8 minutes of budget left for the rest of the month. Act accordingly.
Burn-rate alerts (the smart way)
Don’t alert on raw error rate. Alert on how fast you’re consuming the budget.
If you’d burn the entire month’s budget in 1 hour at the current rate, that’s catastrophic — page someone immediately. If you’d burn it in 24 hours, that’s bad but not page-worthy at 3am.
Multi-window, multi-burn-rate alerts (Google’s recipe):
# Alert if 5m burn rate > 14.4 AND 1h burn rate > 14.4
# (you'd burn the month's budget in 2 hours)
- alert: HighErrorBudgetBurn_Fast
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) > (14.4 * 0.001)
AND
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
for: 2m
severity: critical
# Alert if 30m burn rate > 6 AND 6h burn rate > 6
# (you'd burn the budget in 5 days)
- alert: HighErrorBudgetBurn_Slow
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[30m]))
/ sum(rate(http_requests_total[30m]))
) > (6 * 0.001)
AND
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))
) > (6 * 0.001)
for: 15m
severity: warning
The two-window check kills false positives — a 5-minute spike alone won’t page; only a sustained problem will. Tune the multipliers based on your tolerance.
In 2026 most observability stacks (Sloth, Pyrra, Nobl9, Datadog SLOs, Grafana SLO) generate these alerts from a one-line SLO definition. Use them; don’t hand-roll.
What you do with the budget
The SLO turns reliability into a decision rule, not a vibe.
When the budget is full
- Ship aggressively.
- Try risky deploys; do canary rollouts at higher traffic %.
- Run game days and chaos experiments.
- Enable feature flags faster.
- Reduce review weight on changes.
When the budget is half-spent
- Move at normal pace.
- Tighten canary thresholds.
- Defer non-critical migrations.
When the budget is empty
- Freeze non-essential deploys. Bug fixes for stability only.
- Postpone migrations.
- Engineering team allocates time to reliability backlog until budget recovers.
- Postmortems get more attention.
This is the contract. Engineers love it because it’s clear. PMs eventually love it because it’s predictable. Leadership loves it because it ends the “are we shipping fast or being reliable?” debate.
Composite SLOs
Most user journeys touch many services. The user doesn’t care about your service mesh — they care that “I clicked checkout and it worked.”
Compute a journey SLO:
journey_availability = π(service_availability for each service in path)
If checkout calls cart, payments, and shipping, and each has 99.9%:
journey = 0.999³ = 99.7%
That’s your real SLO from the user’s seat. It pushes you to either improve the weakest service or reduce dependencies. Use journey SLOs for product-level commitments; service SLOs for engineering team alignment.
Things that ruin SLOs
Including planned maintenance in “downtime”
Either commit to “no downtime ever” or carve maintenance windows out of the SLO computation explicitly. Half-hearted commitments breed cynicism.
SLOs nobody acts on
If the SLO burns and nobody changes plans, you don’t have an SLO. You have a dashboard.
Over-counting bad events
A 503 from your service is bad. A 503 from a downstream you depend on but the user blamed on you is also bad — even though it’s “not your fault.” Track both.
SLIs that don’t reflect user pain
A 99.99% availability SLI on /healthz is a lie. SLI traffic must look like real user traffic. Filter your metric to your user-facing endpoints.
Setting it once and forgetting
Re-evaluate SLOs quarterly. Your service’s load profile changes; users’ tolerance changes; product priorities shift. SLOs are a living document.
A starter recipe
For a brand-new service:
- Pick two SLIs: availability (non-5xx) and latency (under threshold).
- Pick conservative targets: 99.5% availability, 95% under 500 ms. You can tighten later.
- Pick a 30-day window.
- Compute the budget. Stick the number on the team Slack channel.
- Set burn-rate alerts at 14.4× (fast) and 6× (slow).
- Set up a dashboard with: current SLI, 30-day burn, time-to-budget-exhaustion at current rate.
- Tell the team: when the budget hits 0, deploys freeze.
That’s enough. Refine as you learn.
What to read next
- The Site Reliability Workbook, Chapter 2. The book chapter on this topic.
- The Sloth or Pyrra docs — both generate Prometheus rules from a one-line SLO definition.
- Observability — Logs, Metrics, Traces — the data plane SLOs need.
- GitOps with Argo CD — automate the “freeze deploys” rule.
If you want a working SLO setup (Prometheus rules, Grafana dashboards, alerts) for a typical FastAPI/Go service, it’s on rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .