On-call done wrong burns people out and they leave. Done right, it’s a manageable shift that occasionally pages and is supported by good tooling. This post is the practical playbook.

Severity

SEV1: customer-impacting, paying customers down, urgent. Page everyone.
SEV2: significant degradation. Page on-call. Fix in hours.
SEV3: noticeable but bounded. Open ticket. Fix this sprint.
SEV4: cosmetic / minor. Backlog.

Posted everywhere. Reviewed quarterly. Not relative — written.

Page volume

A healthy rotation:

  • <5 pages per shift (week of on-call).
  • <2 page-induced wakeups per shift.
  • Most pages have runbooks.

If pages exceed this, fix the system, not the rotation. Common causes:

  • Alerts that fire on noise.
  • Repeated incidents with no follow-through.
  • SLOs misaligned with reality.

For SLOs .

Runbook template

## [Service]: [Symptom]

### What it means
The alert fires when [specific signal]. This indicates [root condition].

### Severity
SEV2 — degrades but doesn't fully break.

### First response
1. Check [dashboard URL].
2. Run `kubectl get pods -n service` to confirm.
3. If [condition], proceed to mitigation A.
4. If [other condition], proceed to mitigation B.

### Mitigation A: restart pods
\`\`\`
kubectl rollout restart deployment/api -n service
\`\`\`
Wait 60s; check dashboard.

### Mitigation B: failover
[Specific steps]

### Don't
- Don't [thing that makes it worse].
- Don't [other anti-pattern].

### Escalate to
Team contact: ...

### Root cause history
- 2026-04-15: caused by [...]; fixed in [PR]

Each runbook one page. Reviewed after every incident touching it.

Escalation

Page → Primary on-call (15 min)
         ↓ no ack
       Secondary on-call (15 min)
         ↓ no ack
       Manager
         ↓ no ack
       Engineering leadership

Pages must be ack’d in time; otherwise escalate. No one stays on a single name forever.

Follow-the-sun

For 24/7 needs, rotate across timezones:

  • AMER team: 9am–9pm ET.
  • EMEA: 9am–9pm CET.
  • APAC: 9am–9pm IST/SGT.

Each team owns “their” hours; off-shift is rare.

Pre-req: at least one engineer per region. For most early-stage companies, single-region with rotation is fine.

Tooling

  • PagerDuty / Opsgenie / Grafana OnCall for paging.
  • incident.io / FireHydrant / Rootly for incident management automation.
  • Statuspage for customer comms.
  • Slack for war room.

Don’t build these. Buy them.

Compensation

On-call costs people. Compensate it:

  • Stipend per shift (e.g., $100–500/week).
  • Time-off in lieu for high-page weeks.
  • Day off after a bad night.

Without compensation, on-call breeds resentment.

What I’d ship today

For a team adopting on-call:

  1. Severity definitions documented.
  2. Rotation calendar in PagerDuty.
  3. Runbook for every alert that pages.
  4. Page volume dashboard — alert when over threshold.
  5. Postmortems for SEV1/2 (Incident Response ).
  6. Stipend per on-call week.

Read this next

If you want my on-call rotation + runbook templates, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .