Production incidents are the test that reveals whether your team really knows the system. The patterns that find root cause fast are well-known. This post is the working loop.

Stabilize first

Customer impact > full understanding.

Symptoms: 500 error rate at 5%; was 0.1%.
Recent change: deploy 8 minutes ago.
Action: roll back deploy. THEN investigate.

Buying time:

  • Recent deploy? Roll back.
  • Specific tenant? Throttle.
  • Specific feature? Disable via flag.
  • Dependency dying? Failover or breaker.
  • Spike? Scale up.

Stabilize → root cause → fix → restore.

The loop

1. Observe (what's happening?).
2. Hypothesize (what could cause it?).
3. Test (which data confirms / refutes?).
4. Narrow (eliminated → next hypothesis).
5. Fix (when root cause clear).

Each iteration: 5-15 minutes. Don’t get stuck on one hypothesis past 30 min — switch.

Top-down observability

Metrics: 5% 500 errors. Started at 14:32. Concentrated in /api/checkout.
Traces: /api/checkout p99 jumped from 200ms → 12s. Time spent in payment.create.
Logs (filtered to /api/checkout, level=error, last 10 min): "stripe: timeout".
Hypothesis: stripe degraded. Confirm via stripe status page. Yes.

Metrics → traces → logs. Top-down narrows scope before drilling.

Patterns

Recently deployed

Most production issues correlate with recent change. Always:

git log --since="2 hours ago" --all
kubectl rollout history deployment/api

Roll back. Investigate after.

Slow dependency

Latency p99 spiked.
Traces: 80% time in db.query("SELECT ...").
DB metrics: cpu 95%, slow queries up.

Hypothesis: someone added a slow query. Check pg_stat_statements.

Slow dependency cascades to your service.

Cascading failure

A is slow → A retries → A's pool exhausts → upstream of A fails → upstream upstream fails.

Symptoms across many services. Root cause often singular.

Look at start of impact: which service had issue first? That’s likely the source.

Resource exhaustion

Memory, CPU, connections, file descriptors. Monitoring catches these — if you have it.

Pod restarts climbing → kubectl describe pod → OOMKilled → Memory limit too low? Memory leak?

Slow leak

Latency creeps up over hours / days. No clear deploy correlation.

Memory grows linearly. No GC reclaiming. Heap dump shows growing structure.

Find the leak; fix; redeploy. Hard to catch without continuous profiling.

External dependency

Stripe / SendGrid / GitHub status: degraded.
Your service depends; cascades.

Always check external statuses early.

Tools and queries

Prometheus / Grafana

# Errors by endpoint
sum(rate(http_requests_total{status=~"5.."}[5m])) by (path)

# Latency p99
histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m]))

# DB connection pool
pg_pool_used / pg_pool_max

Bookmark these. Don’t write them mid-incident.

Logs

# Loki / Datadog / Splunk
{namespace="prod", app="api"} |= "ERROR" | json | trace_id != ""

Filter; correlate by trace_id.

Traces

Open Tempo / Jaeger / Datadog APM. Find slow trace; drill into spans.

kubectl

kubectl get pods --field-selector=status.phase!=Running
kubectl describe pod <pod>
kubectl logs <pod> --previous
kubectl top pods

Common K8s queries during incidents.

See Kubernetes Debugging .

Mid-incident discipline

  • Communicate: status page, customer-facing, internal Slack.
  • Designate roles: incident commander, scribe, comms.
  • Single source of truth: one channel for the incident.
  • Timestamp everything: notes for postmortem.
  • Rollback liberally: don’t be a hero.

What to avoid

  • Multiple changes simultaneously: now you don’t know what worked.
  • Assuming a hypothesis: every hypothesis needs evidence.
  • Long debug sessions: 30+ min on one path → step back; new hypotheses.
  • Reading logs without filters: drowning. Filter aggressively.
  • Manual edits in prod: don’t kubectl edit — get IaC drift.

Postmortem

After: write up. Blameless.

  • Timeline: what happened when.
  • Impact: customer-facing scope.
  • Root cause: actual; not “human error.”
  • What went well: existing safeguards that limited blast.
  • What didn’t: gaps in detection / response.
  • Action items: with owners and deadlines.

See Incident Response 2026 .

Building debug muscle

  • Game days (chaos engineering) practice the loop.
  • Read postmortems from other companies.
  • Run incident scenarios in onboarding.
  • Pair on incidents (junior + senior).

The skill is fast hypothesis switching. Built through practice.

Common mistakes

1. Investigation before mitigation

You’re 30 min into investigating root cause; customers still affected. Mitigate first.

2. Tunnel vision

One hypothesis; refuse to switch. Force yourself to consider 3 alternates.

3. No alerting

Symptom: customer reports outage. You had no idea. Build SLO-based alerts.

4. Tribal knowledge dependence

Only Alice knows how to debug X. Document; pair; spread knowledge.

5. No follow-up

Postmortem written; action items never done. Track to closure.

What I’d ship today

For incident debugging readiness:

  • SLO-based alerts that page on real impact.
  • Standard dashboards per service (errors, latency, deps).
  • Standard queries documented per service.
  • Runbooks for top failure modes.
  • Game days quarterly.
  • Postmortem template + tracking.
  • On-call rotation with prep.

Read this next

If you want my incident playbook + queries cheat sheet, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .