Chaos engineering separates “we wrote retries” from “our retries actually work.” This post is the practical playbook for 2026.
What chaos engineering is
Intentionally inject failures (kill pods, drop network packets, throttle CPU) and observe. The goal: discover gaps in your resilience before production does.
The five steps of a game day
- Define the steady state. “p95 latency under 500ms; error rate under 0.1%.”
- Hypothesize. “If we kill the cache, latency rises but stays under 1s; error rate stays under 0.5%.”
- Inject the failure.
- Measure.
- Compare to hypothesis. Document gaps.
If reality matches hypothesis: you have confidence. If not: you have a list of bugs.
What to inject
Start with the obvious. Each is cheap to inject and surfaces real issues:
- Kill a random pod.
- Drop 50% of packets to a dependency.
- Add 1s latency to a database call.
- Fill the disk on a node.
- Expire all sessions / caches.
- Simulate a region outage.
Each tests a different resilience mechanism — retries , circuit breakers , readiness probes , failover.
Tools
| Type | |
|---|---|
| Chaos Mesh | Kubernetes-native; CRD-driven |
| Litmus | Kubernetes; growing |
| Gremlin | Commercial; broad scope |
| AWS Fault Injection Simulator | AWS-native |
| Toxiproxy | Network proxy for chaos |
| Manual + kubectl | Surprisingly effective |
For 2026 K8s shops: Chaos Mesh is the open default.
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata: { name: kill-api }
spec:
action: pod-kill
mode: random-max-percent
value: "25"
selector:
namespaces: [api]
scheduler:
cron: "@every 30m"
Kills 25% of API pods every 30 minutes. Watch what happens to your SLOs.
Observability is mandatory
Don’t run chaos without observability. You need to see:
- SLO compliance during the experiment.
- Trace shapes — which dependency’s failures cascaded.
- Per-service error rates.
- Alert delivery and response times.
Without this, you broke things and learned nothing. See OpenTelemetry End-to-End .
The cultural shift
The hard part isn’t the tooling. It’s:
- Permission to break things on purpose. Even in staging.
- Appetite to learn from failures. No blame; understand the gap.
- Postmortems on chaos experiments. Same rigor as real incidents — see Incident Response and Postmortems .
- Action items get done. Otherwise the same bug surfaces every game day.
Light version for small teams
A monthly 2-hour session:
- Pick a service.
- Pick a failure mode.
- Inject in staging.
- Watch dashboards / logs.
- Document what surprised you.
- File tickets for fixes.
That’s it. No platform team, no fancy tooling. The discipline matters more than the tools.
Production chaos
For mature teams: small, controlled chaos in production. Netflix’s Simian Army was the original. Most companies don’t need it; staging is enough for 95% of bugs.
If you do: tiny blast radius, gradual ramp, kill switch ready, observability tight.
Common mistakes
1. Chaos without baseline
You don’t know what “normal” looks like. Define steady-state first.
2. No hypothesis
Just breaking things. You learn nothing without a prediction to test.
3. Production-only
Most bugs surface in staging cheaper. Don’t blow up prod for a test you could do safely.
4. No follow-through
Game day reveals 5 bugs; nobody fixes them. Next quarter, same bugs. Pointless.
5. Too aggressive too fast
A new team running 50% pod kill on day 1 is asking for a real outage. Ramp.
What I’d ship today
For a team adopting chaos engineering:
- Define SLOs (see SLOs and Error Budgets ).
- Wire up tracing + dashboards.
- Pick a service; hypothesize what happens when its dependency fails.
- Inject in staging during business hours.
- Postmortem. Action items.
- Repeat monthly with a different scenario.
- Production chaos only after staging is boring.
Boring path. Effective.
Read this next
- Circuit Breakers, Bulkheads, and Backpressure
- SLOs and Error Budgets
- Incident Response and Postmortems
- OpenTelemetry End-to-End
If you want my chaos engineering runbook + game day templates, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .