What's the first move when paged?

Stabilize before debugging. Roll back if recent deploy. Failover if stateful issue. Buy time. THEN investigate root cause. Customer impact comes first; full understanding can wait.

Logs, metrics, or traces first?

Metrics for shape (what's broken? scope?). Traces for one slow request (where's time spent?). Logs for specifics (what error message?). Use them in that order — top-down narrowing.

Debugging Production Incidents in 2026 — A Senior Engineer's Working Loop

Production incidents are the test that reveals whether your team really knows the system. The patterns that find root cause fast are well-known. This post is the working loop.

Stabilize first

Customer impact > full understanding.

Symptoms: 500 error rate at 5%; was 0.1%.
Recent change: deploy 8 minutes ago.
Action: roll back deploy. THEN investigate.

Buying time:

Recent deploy? Roll back.
Specific tenant? Throttle.
Specific feature? Disable via flag.
Dependency dying? Failover or breaker.
Spike? Scale up.

Stabilize → root cause → fix → restore.

The loop

1. Observe (what's happening?).
2. Hypothesize (what could cause it?).
3. Test (which data confirms / refutes?).
4. Narrow (eliminated → next hypothesis).
5. Fix (when root cause clear).

Each iteration: 5-15 minutes. Don’t get stuck on one hypothesis past 30 min — switch.

Top-down observability

Metrics: 5% 500 errors. Started at 14:32. Concentrated in /api/checkout.
   ↓
Traces: /api/checkout p99 jumped from 200ms → 12s. Time spent in payment.create.
   ↓
Logs (filtered to /api/checkout, level=error, last 10 min): "stripe: timeout".
   ↓
Hypothesis: stripe degraded. Confirm via stripe status page. Yes.

Metrics → traces → logs. Top-down narrows scope before drilling.

Patterns

Recently deployed

Most production issues correlate with recent change. Always:

git log --since="2 hours ago" --all
kubectl rollout history deployment/api

Roll back. Investigate after.

Slow dependency

Latency p99 spiked.
Traces: 80% time in db.query("SELECT ...").
DB metrics: cpu 95%, slow queries up.

Hypothesis: someone added a slow query. Check pg_stat_statements.

Slow dependency cascades to your service.

Cascading failure

A is slow → A retries → A's pool exhausts → upstream of A fails → upstream upstream fails.

Symptoms across many services. Root cause often singular.

Look at start of impact: which service had issue first? That’s likely the source.

Resource exhaustion

Memory, CPU, connections, file descriptors. Monitoring catches these — if you have it.

Pod restarts climbing → kubectl describe pod → OOMKilled → Memory limit too low? Memory leak?

Slow leak

Latency creeps up over hours / days. No clear deploy correlation.

Memory grows linearly. No GC reclaiming. Heap dump shows growing structure.

Find the leak; fix; redeploy. Hard to catch without continuous profiling.

External dependency

Stripe / SendGrid / GitHub status: degraded.
Your service depends; cascades.

Always check external statuses early.

Tools and queries

Prometheus / Grafana

# Errors by endpoint
sum(rate(http_requests_total{status=~"5.."}[5m])) by (path)

# Latency p99
histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m]))

# DB connection pool
pg_pool_used / pg_pool_max

Bookmark these. Don’t write them mid-incident.

Logs

# Loki / Datadog / Splunk
{namespace="prod", app="api"} |= "ERROR" | json | trace_id != ""

Filter; correlate by trace_id.

Traces

Open Tempo / Jaeger / Datadog APM. Find slow trace; drill into spans.

kubectl

kubectl get pods --field-selector=status.phase!=Running
kubectl describe pod <pod>
kubectl logs <pod> --previous
kubectl top pods

Common K8s queries during incidents.

See Kubernetes Debugging .

Mid-incident discipline

Communicate: status page, customer-facing, internal Slack.
Designate roles: incident commander, scribe, comms.
Single source of truth: one channel for the incident.
Timestamp everything: notes for postmortem.
Rollback liberally: don’t be a hero.

What to avoid

Multiple changes simultaneously: now you don’t know what worked.
Assuming a hypothesis: every hypothesis needs evidence.
Long debug sessions: 30+ min on one path → step back; new hypotheses.
Reading logs without filters: drowning. Filter aggressively.
Manual edits in prod: don’t kubectl edit — get IaC drift.

Postmortem

After: write up. Blameless.

Timeline: what happened when.
Impact: customer-facing scope.
Root cause: actual; not “human error.”
What went well: existing safeguards that limited blast.
What didn’t: gaps in detection / response.
Action items: with owners and deadlines.

See Incident Response 2026 .

Building debug muscle

Game days (chaos engineering) practice the loop.
Read postmortems from other companies.
Run incident scenarios in onboarding.
Pair on incidents (junior + senior).

The skill is fast hypothesis switching. Built through practice.

Common mistakes

1. Investigation before mitigation

You’re 30 min into investigating root cause; customers still affected. Mitigate first.

2. Tunnel vision

One hypothesis; refuse to switch. Force yourself to consider 3 alternates.

3. No alerting

Symptom: customer reports outage. You had no idea. Build SLO-based alerts.

4. Tribal knowledge dependence

Only Alice knows how to debug X. Document; pair; spread knowledge.

5. No follow-up

Postmortem written; action items never done. Track to closure.

What I’d ship today

For incident debugging readiness:

SLO-based alerts that page on real impact.
Standard dashboards per service (errors, latency, deps).
Standard queries documented per service.
Runbooks for top failure modes.
Game days quarterly.
Postmortem template + tracking.
On-call rotation with prep.

Read this next

If you want my incident playbook + queries cheat sheet, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

Stabilize first#

The loop#

Top-down observability#

Patterns#

Recently deployed#

Slow dependency#

Cascading failure#

Resource exhaustion#

Slow leak#

External dependency#

Tools and queries#

Prometheus / Grafana#

Logs#

Traces#

kubectl#

Mid-incident discipline#

What to avoid#

Postmortem#

Building debug muscle#

Common mistakes#

1. Investigation before mitigation#

2. Tunnel vision#

3. No alerting#

4. Tribal knowledge dependence#

5. No follow-up#

What I’d ship today#

Read this next#