Incidents will happen. The question is whether your team learns from them or repeats them every quarter. This post is the working playbook for 2026 incident response.
Severity levels
Define them and stick to them:
- SEV1: customer impact, paying customers, urgent. Page everyone. War room.
- SEV2: significant degradation but not down. Page on-call. Fix in hours.
- SEV3: noticeable but bounded. Open ticket. Fix this sprint.
- SEV4: cosmetic / minor. Backlog.
Posted somewhere visible. New hires read it.
The Incident Commander
One human is the IC for the duration. Their job: coordinate, not fix.
- Tracks who’s working on what.
- Drives comms (status page, customer notifications).
- Decides escalations.
- Calls “we’re done.”
Without an IC, an incident becomes 5 people doing duplicate work and 10 people watching.
Comms cadence
For SEV1/2:
- Status page within 5 minutes.
- Internal Slack updates every 15–30 min (“still investigating”, “rolled back”, “monitoring”).
- Customer comms within 30 min.
- Closing comms when resolved.
Templates pre-written. Nobody composes prose at 3am.
Runbooks
For known failure modes, runbooks exist. Each is one page:
## Postgres replica lag spike
Symptoms: replica lag > 60s; reads inconsistent.
Diagnosis:
1. Check replication slot: SELECT * FROM pg_replication_slots;
2. Check primary load: ...
3. Check network: ...
Mitigation:
- If WAL is filling: drop dead slots.
- If primary is overloaded: failover.
- If network: ...
Don't:
- Don't just restart replicas — they'll fall further behind.
Runbooks are reviewed and updated after every incident that touched them.
The 5-step incident loop
- Detect — alert fires, customer reports, monitoring catches.
- Triage — pick IC, declare severity.
- Diagnose — what’s wrong?
- Mitigate — restore service. Don’t fix root cause yet.
- Resolve — verified back to normal.
The order matters. Mitigate before fix. A rollback that restores service in 5 minutes is better than a “real fix” that takes 2 hours. Customers wait while you debug.
Postmortem template
Within 5 days of resolution:
## Incident: [Title]
Date: ...
Duration: ...
Severity: ...
Impact: [Specific customers / metrics affected]
## Timeline
- 14:32 alert fires
- 14:35 IC assigned
- 14:40 root cause hypothesized
- 14:55 mitigation applied
- 15:10 verified normal
## Root cause
What actually happened. Technical, specific, blameless.
## Why it took N minutes
Detection: ...
Diagnosis: ...
Mitigation: ...
## What went well
...
## What didn't
...
## Action items
- [Type] [Description] — [Owner] — [Due]
- ...
Action items have owners and due dates, tracked like any other engineering work.
Blameless
The rule: when discussing the incident, no individual is named for “the cause.” We talk about systems, processes, missing safeguards. People made decisions with the info they had; usually a better system would have prevented the bad decision.
Why: blame culture means people hide mistakes. Hidden mistakes don’t get fixed. Blameless culture surfaces failures so they get fixed.
It does NOT mean “no one is responsible.” Responsibility is for getting the action items done, not for “causing” the incident.
Action item rigor
The single most important practice: action items get done.
- Each AI has an owner and due date.
- Status reviewed weekly.
- Track completion rate as a team metric.
- If action items aren’t getting done, surface to leadership.
Postmortems where 80% of ITs are stale are theatre.
Cultural cues
Things that signal a healthy incident culture:
- “What was the fastest way to mitigate?” gets asked, not “who pushed the button.”
- Incident commanders rotate widely; not just one or two heroes.
- Postmortems are reviewed by people not on the incident.
- Action items finish.
- Repeat incidents are rare; lessons stick.
Tools
- PagerDuty / Opsgenie / Grafana OnCall for paging.
- Status pages — Statuspage.io, Atlassian, or self-hosted.
- Incident Slack bot — incident.io, FireHydrant, Rootly automate IC bookkeeping.
- Runbook docs — wherever the team writes (Notion, Slab, internal wiki).
- Postmortem repo — every incident archived, searchable.
For the SLO + error-budget side see SLOs and Error Budgets .
Read this next
- SLOs and Error Budgets for App Developers
- OpenTelemetry End-to-End in 2026
- Platform Engineering and IDPs
- Circuit Breakers, Bulkheads, and Backpressure
If you want my incident-response runbook + postmortem templates, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .