Chaos engineering separates “we wrote retries” from “our retries actually work.” This post is the practical playbook for 2026.

What chaos engineering is

Intentionally inject failures (kill pods, drop network packets, throttle CPU) and observe. The goal: discover gaps in your resilience before production does.

The five steps of a game day

  1. Define the steady state. “p95 latency under 500ms; error rate under 0.1%.”
  2. Hypothesize. “If we kill the cache, latency rises but stays under 1s; error rate stays under 0.5%.”
  3. Inject the failure.
  4. Measure.
  5. Compare to hypothesis. Document gaps.

If reality matches hypothesis: you have confidence. If not: you have a list of bugs.

What to inject

Start with the obvious. Each is cheap to inject and surfaces real issues:

  • Kill a random pod.
  • Drop 50% of packets to a dependency.
  • Add 1s latency to a database call.
  • Fill the disk on a node.
  • Expire all sessions / caches.
  • Simulate a region outage.

Each tests a different resilience mechanism — retries , circuit breakers , readiness probes , failover.

Tools

Type
Chaos MeshKubernetes-native; CRD-driven
LitmusKubernetes; growing
GremlinCommercial; broad scope
AWS Fault Injection SimulatorAWS-native
ToxiproxyNetwork proxy for chaos
Manual + kubectlSurprisingly effective

For 2026 K8s shops: Chaos Mesh is the open default.

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata: { name: kill-api }
spec:
  action: pod-kill
  mode: random-max-percent
  value: "25"
  selector:
    namespaces: [api]
  scheduler:
    cron: "@every 30m"

Kills 25% of API pods every 30 minutes. Watch what happens to your SLOs.

Observability is mandatory

Don’t run chaos without observability. You need to see:

  • SLO compliance during the experiment.
  • Trace shapes — which dependency’s failures cascaded.
  • Per-service error rates.
  • Alert delivery and response times.

Without this, you broke things and learned nothing. See OpenTelemetry End-to-End .

The cultural shift

The hard part isn’t the tooling. It’s:

  • Permission to break things on purpose. Even in staging.
  • Appetite to learn from failures. No blame; understand the gap.
  • Postmortems on chaos experiments. Same rigor as real incidents — see Incident Response and Postmortems .
  • Action items get done. Otherwise the same bug surfaces every game day.

Light version for small teams

A monthly 2-hour session:

  1. Pick a service.
  2. Pick a failure mode.
  3. Inject in staging.
  4. Watch dashboards / logs.
  5. Document what surprised you.
  6. File tickets for fixes.

That’s it. No platform team, no fancy tooling. The discipline matters more than the tools.

Production chaos

For mature teams: small, controlled chaos in production. Netflix’s Simian Army was the original. Most companies don’t need it; staging is enough for 95% of bugs.

If you do: tiny blast radius, gradual ramp, kill switch ready, observability tight.

Common mistakes

1. Chaos without baseline

You don’t know what “normal” looks like. Define steady-state first.

2. No hypothesis

Just breaking things. You learn nothing without a prediction to test.

3. Production-only

Most bugs surface in staging cheaper. Don’t blow up prod for a test you could do safely.

4. No follow-through

Game day reveals 5 bugs; nobody fixes them. Next quarter, same bugs. Pointless.

5. Too aggressive too fast

A new team running 50% pod kill on day 1 is asking for a real outage. Ramp.

What I’d ship today

For a team adopting chaos engineering:

  1. Define SLOs (see SLOs and Error Budgets ).
  2. Wire up tracing + dashboards.
  3. Pick a service; hypothesize what happens when its dependency fails.
  4. Inject in staging during business hours.
  5. Postmortem. Action items.
  6. Repeat monthly with a different scenario.
  7. Production chaos only after staging is boring.

Boring path. Effective.

Read this next

If you want my chaos engineering runbook + game day templates, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .