SLOs and Error Budgets for App Developers — SRE Without the Mystique

The Google SRE book is excellent and intimidating. This post is the working summary every backend developer should have read by now. Three terms, one idea, one lever. Once you internalize it, the conversation between “ship features” and “fix reliability” stops being a religion.

SLIs, SLOs, SLAs — what each one is

SLI — Service Level Indicator. A measurement of one aspect of service quality. Usually a ratio: good_events / total_events.

Example: of all HTTP requests in the last hour, what fraction returned 2xx within 500 ms?

SLO — Service Level Objective. A target for an SLI over a window.

Example: 99.9% of requests in any 30-day window return 2xx within 500 ms.

SLA — Service Level Agreement. A contract. The SLO surfaced to a customer with consequences. Usually a number lower than the SLO (“we promise 99.5%, we target 99.9%”).

You build internal SLOs. Sales sells SLAs. Don’t confuse them.

What an error budget is

If your SLO is 99.9%, your error budget is 0.1%. That’s the share of bad events you’re allowed.

For a service handling 10M requests/month at a 99.9% SLO:

budget = 10,000,000 × 0.001 = 10,000 bad requests

Ten thousand bad requests over the month is fine. It’s expected. It’s why you set 99.9% and not 100%.

The budget is a currency. You spend it on:

Risky deploys
Experiments with new features
Capacity reductions
Incidents

You earn it back by:

Not deploying badly
Adding reliability work
Time passing

When the budget is healthy, ship features aggressively. When it’s exhausted, freeze deploys until you’ve earned it back. This is the entire point. It replaces opinions with arithmetic.

Picking the right SLI

The SRE book has an exhaustive taxonomy. In practice, ~95% of services need just two SLIs:

1. Availability — “did the user get an answer at all?”

sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

Anything 5xx or no-response counts as bad. Anything else counts as good. 4xx is not bad — that’s the user’s fault, not yours.

2. Latency — “did the user get the answer fast enough?”

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

The fraction of requests that completed within 500 ms. Anything slower is bad.

Pick one threshold, not p50/p95/p99. SLO arithmetic gets confusing fast. The threshold is “the latency above which the user notices.” Most consumer-facing APIs land between 300 ms and 1 s.

For a payment service: maybe <200ms is the line. For a video render service: maybe <60s. Pick what matches the user’s actual perception.

Setting a target you’ll defend

Pick a number you would defend in an incident review. Not the highest you can imagine, not 99.999% because it sounds impressive.

Reasonable defaults:

Service criticality	Availability	Latency target
Internal experiment	99%	1 s
Standard backend	99.5%	500 ms
Customer-facing API	99.9%	300 ms
Payment / auth	99.95%	200 ms
Single point of failure	99.99%	100 ms

Don’t go higher than 99.99% lightly. Each extra nine costs 10x effort. Most engineering teams set 99.99% on their auth service and quietly miss it every month, which means the SLO isn’t doing anything.

Honesty rule: if you’ve never hit your SLO target in the last 90 days, your target is wrong, not your service.

Calculating the budget

The budget over a window of N events at SLO s:

budget = N × (1 - s)

In time terms (a 30-day month = 43,200 minutes):

SLO	Allowed downtime / 30 days
99%	432 min (7.2 h)
99.5%	216 min (3.6 h)
99.9%	43.2 min
99.95%	21.6 min
99.99%	4.32 min

These are the numbers that should set your panic threshold during incidents. If your SLO is 99.9% and an incident has been running for 35 minutes, you have 8 minutes of budget left for the rest of the month. Act accordingly.

Burn-rate alerts (the smart way)

Don’t alert on raw error rate. Alert on how fast you’re consuming the budget.

If you’d burn the entire month’s budget in 1 hour at the current rate, that’s catastrophic — page someone immediately. If you’d burn it in 24 hours, that’s bad but not page-worthy at 3am.

Multi-window, multi-burn-rate alerts (Google’s recipe):

# Alert if 5m burn rate > 14.4 AND 1h burn rate > 14.4
# (you'd burn the month's budget in 2 hours)
- alert: HighErrorBudgetBurn_Fast
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m]))
    ) > (14.4 * 0.001)
    AND
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      / sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)
  for: 2m
  severity: critical

# Alert if 30m burn rate > 6 AND 6h burn rate > 6
# (you'd burn the budget in 5 days)
- alert: HighErrorBudgetBurn_Slow
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[30m]))
      / sum(rate(http_requests_total[30m]))
    ) > (6 * 0.001)
    AND
    (
      sum(rate(http_requests_total{status=~"5.."}[6h]))
      / sum(rate(http_requests_total[6h]))
    ) > (6 * 0.001)
  for: 15m
  severity: warning

The two-window check kills false positives — a 5-minute spike alone won’t page; only a sustained problem will. Tune the multipliers based on your tolerance.

In 2026 most observability stacks (Sloth, Pyrra, Nobl9, Datadog SLOs, Grafana SLO) generate these alerts from a one-line SLO definition. Use them; don’t hand-roll.

What you do with the budget

The SLO turns reliability into a decision rule, not a vibe.

When the budget is full

Ship aggressively.
Try risky deploys; do canary rollouts at higher traffic %.
Run game days and chaos experiments.
Enable feature flags faster.
Reduce review weight on changes.

When the budget is half-spent

Move at normal pace.
Tighten canary thresholds.
Defer non-critical migrations.

When the budget is empty

Freeze non-essential deploys. Bug fixes for stability only.
Postpone migrations.
Engineering team allocates time to reliability backlog until budget recovers.
Postmortems get more attention.

This is the contract. Engineers love it because it’s clear. PMs eventually love it because it’s predictable. Leadership loves it because it ends the “are we shipping fast or being reliable?” debate.

Composite SLOs

Most user journeys touch many services. The user doesn’t care about your service mesh — they care that “I clicked checkout and it worked.”

Compute a journey SLO:

journey_availability = π(service_availability for each service in path)

If checkout calls cart, payments, and shipping, and each has 99.9%:

journey = 0.999³ = 99.7%

That’s your real SLO from the user’s seat. It pushes you to either improve the weakest service or reduce dependencies. Use journey SLOs for product-level commitments; service SLOs for engineering team alignment.

Things that ruin SLOs

Including planned maintenance in “downtime”

Either commit to “no downtime ever” or carve maintenance windows out of the SLO computation explicitly. Half-hearted commitments breed cynicism.

SLOs nobody acts on

If the SLO burns and nobody changes plans, you don’t have an SLO. You have a dashboard.

Over-counting bad events

A 503 from your service is bad. A 503 from a downstream you depend on but the user blamed on you is also bad — even though it’s “not your fault.” Track both.

SLIs that don’t reflect user pain

A 99.99% availability SLI on /healthz is a lie. SLI traffic must look like real user traffic. Filter your metric to your user-facing endpoints.

Setting it once and forgetting

Re-evaluate SLOs quarterly. Your service’s load profile changes; users’ tolerance changes; product priorities shift. SLOs are a living document.

A starter recipe

For a brand-new service:

Pick two SLIs: availability (non-5xx) and latency (under threshold).
Pick conservative targets: 99.5% availability, 95% under 500 ms. You can tighten later.
Pick a 30-day window.
Compute the budget. Stick the number on the team Slack channel.
Set burn-rate alerts at 14.4× (fast) and 6× (slow).
Set up a dashboard with: current SLI, 30-day burn, time-to-budget-exhaustion at current rate.
Tell the team: when the budget hits 0, deploys freeze.

That’s enough. Refine as you learn.

SLIs, SLOs, SLAs — what each one is#

What an error budget is#

Picking the right SLI#

1. Availability — “did the user get an answer at all?”#

2. Latency — “did the user get the answer fast enough?”#

Setting a target you’ll defend#

Calculating the budget#

Burn-rate alerts (the smart way)#

What you do with the budget#

When the budget is full#

When the budget is half-spent#

When the budget is empty#

Composite SLOs#

Things that ruin SLOs#

Including planned maintenance in “downtime”#

SLOs nobody acts on#

Over-counting bad events#

SLIs that don’t reflect user pain#

Setting it once and forgetting#

A starter recipe#

What to read next#