How big should an on-call rotation be?

Minimum 4–6 people for a sustainable 24/7 rotation; 8+ for follow-the-sun across regions. Below that you burn people out fast. Below 4 you should question whether 24/7 is needed.

How do I make on-call less painful?

Pages should be actionable, rare, and have runbooks. If a page fires and there's nothing to do, fix the alert. If the runbook is missing, write it after the incident.

On-Call and Runbooks That Save Your Friday Night in 2026

On-call done wrong burns people out and they leave. Done right, it’s a manageable shift that occasionally pages and is supported by good tooling. This post is the practical playbook.

Severity

SEV1: customer-impacting, paying customers down, urgent. Page everyone.
SEV2: significant degradation. Page on-call. Fix in hours.
SEV3: noticeable but bounded. Open ticket. Fix this sprint.
SEV4: cosmetic / minor. Backlog.

Posted everywhere. Reviewed quarterly. Not relative — written.

Page volume

A healthy rotation:

<5 pages per shift (week of on-call).
<2 page-induced wakeups per shift.
Most pages have runbooks.

If pages exceed this, fix the system, not the rotation. Common causes:

Alerts that fire on noise.
Repeated incidents with no follow-through.
SLOs misaligned with reality.

For SLOs .

Runbook template

## [Service]: [Symptom]

### What it means
The alert fires when [specific signal]. This indicates [root condition].

### Severity
SEV2 — degrades but doesn't fully break.

### First response
1. Check [dashboard URL].
2. Run `kubectl get pods -n service` to confirm.
3. If [condition], proceed to mitigation A.
4. If [other condition], proceed to mitigation B.

### Mitigation A: restart pods
\`\`\`
kubectl rollout restart deployment/api -n service
\`\`\`
Wait 60s; check dashboard.

### Mitigation B: failover
[Specific steps]

### Don't
- Don't [thing that makes it worse].
- Don't [other anti-pattern].

### Escalate to
Team contact: ...

### Root cause history
- 2026-04-15: caused by [...]; fixed in [PR]

Each runbook one page. Reviewed after every incident touching it.

Escalation

Page → Primary on-call (15 min)
         ↓ no ack
       Secondary on-call (15 min)
         ↓ no ack
       Manager
         ↓ no ack
       Engineering leadership

Pages must be ack’d in time; otherwise escalate. No one stays on a single name forever.

Follow-the-sun

For 24/7 needs, rotate across timezones:

AMER team: 9am–9pm ET.
EMEA: 9am–9pm CET.
APAC: 9am–9pm IST/SGT.

Each team owns “their” hours; off-shift is rare.

Pre-req: at least one engineer per region. For most early-stage companies, single-region with rotation is fine.

Tooling

PagerDuty / Opsgenie / Grafana OnCall for paging.
incident.io / FireHydrant / Rootly for incident management automation.
Statuspage for customer comms.
Slack for war room.

Don’t build these. Buy them.

Compensation

On-call costs people. Compensate it:

Stipend per shift (e.g., $100–500/week).
Time-off in lieu for high-page weeks.
Day off after a bad night.

Without compensation, on-call breeds resentment.

What I’d ship today

For a team adopting on-call:

Severity definitions documented.
Rotation calendar in PagerDuty.
Runbook for every alert that pages.
Page volume dashboard — alert when over threshold.
Postmortems for SEV1/2 (Incident Response ).
Stipend per on-call week.

Read this next

If you want my on-call rotation + runbook templates, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

Severity#

Page volume#

Runbook template#

Escalation#

Follow-the-sun#

Tooling#

Compensation#

What I’d ship today#

Read this next#