On-call done wrong burns people out and they leave. Done right, it’s a manageable shift that occasionally pages and is supported by good tooling. This post is the practical playbook.
Severity
SEV1: customer-impacting, paying customers down, urgent. Page everyone.
SEV2: significant degradation. Page on-call. Fix in hours.
SEV3: noticeable but bounded. Open ticket. Fix this sprint.
SEV4: cosmetic / minor. Backlog.
Posted everywhere. Reviewed quarterly. Not relative — written.
Page volume
A healthy rotation:
- <5 pages per shift (week of on-call).
- <2 page-induced wakeups per shift.
- Most pages have runbooks.
If pages exceed this, fix the system, not the rotation. Common causes:
- Alerts that fire on noise.
- Repeated incidents with no follow-through.
- SLOs misaligned with reality.
For SLOs .
Runbook template
## [Service]: [Symptom]
### What it means
The alert fires when [specific signal]. This indicates [root condition].
### Severity
SEV2 — degrades but doesn't fully break.
### First response
1. Check [dashboard URL].
2. Run `kubectl get pods -n service` to confirm.
3. If [condition], proceed to mitigation A.
4. If [other condition], proceed to mitigation B.
### Mitigation A: restart pods
\`\`\`
kubectl rollout restart deployment/api -n service
\`\`\`
Wait 60s; check dashboard.
### Mitigation B: failover
[Specific steps]
### Don't
- Don't [thing that makes it worse].
- Don't [other anti-pattern].
### Escalate to
Team contact: ...
### Root cause history
- 2026-04-15: caused by [...]; fixed in [PR]
Each runbook one page. Reviewed after every incident touching it.
Escalation
Page → Primary on-call (15 min)
↓ no ack
Secondary on-call (15 min)
↓ no ack
Manager
↓ no ack
Engineering leadership
Pages must be ack’d in time; otherwise escalate. No one stays on a single name forever.
Follow-the-sun
For 24/7 needs, rotate across timezones:
- AMER team: 9am–9pm ET.
- EMEA: 9am–9pm CET.
- APAC: 9am–9pm IST/SGT.
Each team owns “their” hours; off-shift is rare.
Pre-req: at least one engineer per region. For most early-stage companies, single-region with rotation is fine.
Tooling
- PagerDuty / Opsgenie / Grafana OnCall for paging.
- incident.io / FireHydrant / Rootly for incident management automation.
- Statuspage for customer comms.
- Slack for war room.
Don’t build these. Buy them.
Compensation
On-call costs people. Compensate it:
- Stipend per shift (e.g., $100–500/week).
- Time-off in lieu for high-page weeks.
- Day off after a bad night.
Without compensation, on-call breeds resentment.
What I’d ship today
For a team adopting on-call:
- Severity definitions documented.
- Rotation calendar in PagerDuty.
- Runbook for every alert that pages.
- Page volume dashboard — alert when over threshold.
- Postmortems for SEV1/2 (Incident Response ).
- Stipend per on-call week.
Read this next
- Incident Response and Postmortems
- SLOs and Error Budgets for App Developers
- OpenTelemetry End-to-End in 2026
- Observability 2.0
If you want my on-call rotation + runbook templates, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .