Production incidents are the test that reveals whether your team really knows the system. The patterns that find root cause fast are well-known. This post is the working loop.
Stabilize first
Customer impact > full understanding.
Symptoms: 500 error rate at 5%; was 0.1%.
Recent change: deploy 8 minutes ago.
Action: roll back deploy. THEN investigate.
Buying time:
- Recent deploy? Roll back.
- Specific tenant? Throttle.
- Specific feature? Disable via flag.
- Dependency dying? Failover or breaker.
- Spike? Scale up.
Stabilize → root cause → fix → restore.
The loop
1. Observe (what's happening?).
2. Hypothesize (what could cause it?).
3. Test (which data confirms / refutes?).
4. Narrow (eliminated → next hypothesis).
5. Fix (when root cause clear).
Each iteration: 5-15 minutes. Don’t get stuck on one hypothesis past 30 min — switch.
Top-down observability
Metrics: 5% 500 errors. Started at 14:32. Concentrated in /api/checkout.
↓
Traces: /api/checkout p99 jumped from 200ms → 12s. Time spent in payment.create.
↓
Logs (filtered to /api/checkout, level=error, last 10 min): "stripe: timeout".
↓
Hypothesis: stripe degraded. Confirm via stripe status page. Yes.
Metrics → traces → logs. Top-down narrows scope before drilling.
Patterns
Recently deployed
Most production issues correlate with recent change. Always:
git log --since="2 hours ago" --all
kubectl rollout history deployment/api
Roll back. Investigate after.
Slow dependency
Latency p99 spiked.
Traces: 80% time in db.query("SELECT ...").
DB metrics: cpu 95%, slow queries up.
Hypothesis: someone added a slow query. Check pg_stat_statements.
Slow dependency cascades to your service.
Cascading failure
A is slow → A retries → A's pool exhausts → upstream of A fails → upstream upstream fails.
Symptoms across many services. Root cause often singular.
Look at start of impact: which service had issue first? That’s likely the source.
Resource exhaustion
Memory, CPU, connections, file descriptors. Monitoring catches these — if you have it.
Pod restarts climbing → kubectl describe pod → OOMKilled → Memory limit too low? Memory leak?
Slow leak
Latency creeps up over hours / days. No clear deploy correlation.
Memory grows linearly. No GC reclaiming. Heap dump shows growing structure.
Find the leak; fix; redeploy. Hard to catch without continuous profiling.
External dependency
Stripe / SendGrid / GitHub status: degraded.
Your service depends; cascades.
Always check external statuses early.
Tools and queries
Prometheus / Grafana
# Errors by endpoint
sum(rate(http_requests_total{status=~"5.."}[5m])) by (path)
# Latency p99
histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m]))
# DB connection pool
pg_pool_used / pg_pool_max
Bookmark these. Don’t write them mid-incident.
Logs
# Loki / Datadog / Splunk
{namespace="prod", app="api"} |= "ERROR" | json | trace_id != ""
Filter; correlate by trace_id.
Traces
Open Tempo / Jaeger / Datadog APM. Find slow trace; drill into spans.
kubectl
kubectl get pods --field-selector=status.phase!=Running
kubectl describe pod <pod>
kubectl logs <pod> --previous
kubectl top pods
Common K8s queries during incidents.
See Kubernetes Debugging .
Mid-incident discipline
- Communicate: status page, customer-facing, internal Slack.
- Designate roles: incident commander, scribe, comms.
- Single source of truth: one channel for the incident.
- Timestamp everything: notes for postmortem.
- Rollback liberally: don’t be a hero.
What to avoid
- Multiple changes simultaneously: now you don’t know what worked.
- Assuming a hypothesis: every hypothesis needs evidence.
- Long debug sessions: 30+ min on one path → step back; new hypotheses.
- Reading logs without filters: drowning. Filter aggressively.
- Manual edits in prod: don’t
kubectl edit— get IaC drift.
Postmortem
After: write up. Blameless.
- Timeline: what happened when.
- Impact: customer-facing scope.
- Root cause: actual; not “human error.”
- What went well: existing safeguards that limited blast.
- What didn’t: gaps in detection / response.
- Action items: with owners and deadlines.
See Incident Response 2026 .
Building debug muscle
- Game days (chaos engineering) practice the loop.
- Read postmortems from other companies.
- Run incident scenarios in onboarding.
- Pair on incidents (junior + senior).
The skill is fast hypothesis switching. Built through practice.
Common mistakes
1. Investigation before mitigation
You’re 30 min into investigating root cause; customers still affected. Mitigate first.
2. Tunnel vision
One hypothesis; refuse to switch. Force yourself to consider 3 alternates.
3. No alerting
Symptom: customer reports outage. You had no idea. Build SLO-based alerts.
4. Tribal knowledge dependence
Only Alice knows how to debug X. Document; pair; spread knowledge.
5. No follow-up
Postmortem written; action items never done. Track to closure.
What I’d ship today
For incident debugging readiness:
- SLO-based alerts that page on real impact.
- Standard dashboards per service (errors, latency, deps).
- Standard queries documented per service.
- Runbooks for top failure modes.
- Game days quarterly.
- Postmortem template + tracking.
- On-call rotation with prep.
Read this next
If you want my incident playbook + queries cheat sheet, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .