Do small teams need formal incident response?

Yes — even a 5-person team. The lightweight version: a Slack channel, a designated IC for each incident, a 1-page postmortem template. Heavier processes scale up later.

What's the difference between blameless and 'blame-free'?

Blameless means we don't assign individual fault — we look at the system that allowed the error. 'Blame-free' often means avoiding hard truths. Blameless still surfaces what went wrong; it just doesn't punish the person who pushed the button.

Incident Response and Blameless Postmortems in 2026

Incidents will happen. The question is whether your team learns from them or repeats them every quarter. This post is the working playbook for 2026 incident response.

Severity levels

Define them and stick to them:

SEV1: customer impact, paying customers, urgent. Page everyone. War room.
SEV2: significant degradation but not down. Page on-call. Fix in hours.
SEV3: noticeable but bounded. Open ticket. Fix this sprint.
SEV4: cosmetic / minor. Backlog.

Posted somewhere visible. New hires read it.

The Incident Commander

One human is the IC for the duration. Their job: coordinate, not fix.

Tracks who’s working on what.
Drives comms (status page, customer notifications).
Decides escalations.
Calls “we’re done.”

Without an IC, an incident becomes 5 people doing duplicate work and 10 people watching.

Comms cadence

For SEV1/2:

Status page within 5 minutes.
Internal Slack updates every 15–30 min (“still investigating”, “rolled back”, “monitoring”).
Customer comms within 30 min.
Closing comms when resolved.

Templates pre-written. Nobody composes prose at 3am.

Runbooks

For known failure modes, runbooks exist. Each is one page:

## Postgres replica lag spike

Symptoms: replica lag > 60s; reads inconsistent.

Diagnosis:
1. Check replication slot: SELECT * FROM pg_replication_slots;
2. Check primary load: ...
3. Check network: ...

Mitigation:
- If WAL is filling: drop dead slots.
- If primary is overloaded: failover.
- If network: ...

Don't:
- Don't just restart replicas — they'll fall further behind.

Runbooks are reviewed and updated after every incident that touched them.

The 5-step incident loop

Detect — alert fires, customer reports, monitoring catches.
Triage — pick IC, declare severity.
Diagnose — what’s wrong?
Mitigate — restore service. Don’t fix root cause yet.
Resolve — verified back to normal.

The order matters. Mitigate before fix. A rollback that restores service in 5 minutes is better than a “real fix” that takes 2 hours. Customers wait while you debug.

Postmortem template

Within 5 days of resolution:

## Incident: [Title]

Date: ...
Duration: ...
Severity: ...
Impact: [Specific customers / metrics affected]

## Timeline
- 14:32 alert fires
- 14:35 IC assigned
- 14:40 root cause hypothesized
- 14:55 mitigation applied
- 15:10 verified normal

## Root cause
What actually happened. Technical, specific, blameless.

## Why it took N minutes
Detection: ...
Diagnosis: ...
Mitigation: ...

## What went well
...

## What didn't
...

## Action items
- [Type] [Description] — [Owner] — [Due]
- ...

Action items have owners and due dates, tracked like any other engineering work.

Blameless

The rule: when discussing the incident, no individual is named for “the cause.” We talk about systems, processes, missing safeguards. People made decisions with the info they had; usually a better system would have prevented the bad decision.

Why: blame culture means people hide mistakes. Hidden mistakes don’t get fixed. Blameless culture surfaces failures so they get fixed.

It does NOT mean “no one is responsible.” Responsibility is for getting the action items done, not for “causing” the incident.

Action item rigor

The single most important practice: action items get done.

Each AI has an owner and due date.
Status reviewed weekly.
Track completion rate as a team metric.
If action items aren’t getting done, surface to leadership.

Postmortems where 80% of ITs are stale are theatre.

Cultural cues

Things that signal a healthy incident culture:

“What was the fastest way to mitigate?” gets asked, not “who pushed the button.”
Incident commanders rotate widely; not just one or two heroes.
Postmortems are reviewed by people not on the incident.
Action items finish.
Repeat incidents are rare; lessons stick.

Tools

PagerDuty / Opsgenie / Grafana OnCall for paging.
Status pages — Statuspage.io, Atlassian, or self-hosted.
Incident Slack bot — incident.io, FireHydrant, Rootly automate IC bookkeeping.
Runbook docs — wherever the team writes (Notion, Slab, internal wiki).
Postmortem repo — every incident archived, searchable.

For the SLO + error-budget side see SLOs and Error Budgets .

Read this next

If you want my incident-response runbook + postmortem templates, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

Severity levels#

The Incident Commander#

Comms cadence#

Runbooks#

The 5-step incident loop#

Postmortem template#

Blameless#

Action item rigor#

Cultural cues#

Tools#

Read this next#