Outages happen. The difference between a great team and a struggling one isn’t outage frequency — it’s how they respond, communicate, and learn. This post is the working playbook.

Roles

For anything beyond a 5-minute glitch:

  • Incident Commander (IC) — drives the response. Not the one debugging; the one coordinating. Decides scope, escalates, owns updates.
  • Ops — the engineer(s) actually debugging.
  • Comms — talks to stakeholders, customers, status page.
  • Scribe — keeps a running timeline.

For small teams, one person wears two hats — but call out the roles explicitly.

Severity

ImpactResponse
SEV1Major customer impact, revenue at riskAll hands, IC, page execs
SEV2Significant impact, mitigation possibleIC + ops, status page
SEV3Minor / single-customerEngineer only, no comms

Pre-define. Avoid debate during the incident.

The incident channel

Dedicated channel per incident:

#inc-2026-05-02-checkout-down

All comms in one place. Decisions, hypotheses, links to dashboards, command output. Easy to reconstruct timeline later.

For tooling: incident.io, FireHydrant, PagerDuty Incident, or a Slack workflow + Bolt.

Status updates

Every 15–30 min, even if “still investigating”:

[09:42 UTC] 30% of checkouts failing. Investigating Stripe webhook errors.
            Suspected rate limit. Mitigation: switching to retry queue.
            ETA next update: 09:55.

Stakeholders calm down when they see updates. Without updates, they ping individually, distracting ops.

Status page

Public status page (Statuspage, Cachet, custom) with:

  • Current incident details.
  • Affected services.
  • Last update timestamp.

Don’t promise resolution times you can’t keep. Say “investigating,” “identified,” “monitoring,” “resolved.” Honest > optimistic.

Runbooks

For predictable failures, runbooks reduce panic:

# Runbook: Stripe Webhook Failures

## Symptoms
- 5xx rate spike on /webhooks/stripe
- Stripe dashboard shows webhook delivery failures

## Quick mitigations
1. Check Stripe API status: https://status.stripe.com
2. Check our queue depth: https://grafana/d/queues
3. If queue backed up, scale workers: `kubectl scale deploy/webhook-worker --replicas=20`

## Diagnosis
1. Check logs: `loki query {service="webhook-worker"} |= "stripe"`
2. ...

## Escalation
- Page #payments-oncall if blocked > 30 min

Linked from alerts. Keeps junior on-callers from cold-starting at 3am.

Mitigation > root cause

During an incident, the priority is: stop the bleeding, then diagnose.

Bad: "Let me trace this through the code path to understand..."
Good: "Roll back the deploy. Then we'll investigate."

Mitigations available in seconds:

  • Rollback deploy.
  • Flip kill switch (see Feature Flags ).
  • Scale up.
  • Block bad traffic.
  • Failover to standby.

Diagnose afterward, in calmer waters.

Blameless postmortem

Within 5 business days. Format:

# Postmortem: 2026-05-02 Checkout Outage

## Impact
- 32 minutes of degraded checkout (12% failure rate).
- ~1500 affected users.
- $X estimated revenue impact.

## Timeline (UTC)
- 09:32 — alert: 5xx spike
- 09:35 — IC declared, ops investigating
- 09:42 — root cause hypothesized: bad deploy
- 09:48 — rollback initiated
- 10:04 — recovery confirmed

## Root cause
The 09:30 deploy of webhook-worker introduced a bug in retry handling.
Tests passed; integration tests didn't cover this code path.

## What went well
- Alert fired within 2min of impact.
- Rollback completed in under 15min.
- Status page updated promptly.

## What didn't
- The bug should have been caught by integration tests.
- Runbook for "checkout 5xx" didn't exist.

## Action items
- [ ] @alice: add integration test for retry path (P0, 1 week)
- [ ] @bob: write checkout runbook (P1, 2 weeks)
- [ ] @carol: enable canary deploy for webhook-worker (P1, 2 weeks)

Blameless: focus on the system, not the person. “The deploy passed because tests didn’t cover X” — not “Alice’s PR broke prod.”

Action items have owners and deadlines

A postmortem with no follow-through is a journaling exercise. Each action item: owner, deadline, priority. Tracked in your normal task system. Reviewed weekly.

Practice

Game days: simulate an outage in staging. On-call rotates through scenarios:

  • “Database is unreachable.”
  • “Stripe is returning 500s.”
  • “Cache is full.”

Practice the runbooks, the comms, the IC role. Surfaces gaps before real incidents.

On-call hygiene

  • One primary, one backup per rotation.
  • Schedule visibility: PagerDuty / Opsgenie.
  • Compensation (time off, pay) for being on call.
  • Page only if actionable: if there’s no action, alert via dashboard not page.
  • Post-incident on-call retro: was the page noisy? Was a runbook missing?

Burnout from bad on-call kills teams.

Tools

  • PagerDuty / Opsgenie: rotations + paging.
  • incident.io / FireHydrant: incident workflow.
  • Statuspage / Cachet: public status.
  • Loki / Datadog / Grafana: logs + metrics.
  • Tempo / Jaeger: distributed tracing.
  • Slack / Discord: comms channel.

For startup: PagerDuty + Slack channel + Statuspage.io is enough.

Common mistakes

1. No IC named

Three people debugging in parallel; nobody coordinating. Always name an IC.

2. Internal comms only

Customers don’t know the outage is acknowledged. Anxiety + support tickets. Public status page.

3. Postmortem skipped

“It’s fixed; let’s move on.” Same outage three months later. Always postmortem SEV1/2.

4. Action items without owners

“We should improve testing.” Who? When? Specific or it doesn’t happen.

5. Blame culture

Engineers afraid to be honest about what they did. Information stops flowing. Blameless framing — always.

Read this next

If you want my incident response template + runbook examples, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .