A circuit breaker stops your service from hammering a dying dependency. Done right, your failures stay isolated. Done wrong, the breaker itself becomes the bug. This post is the working playbook.
The pattern
[Closed] ── too many failures ──▶ [Open] ── timeout ──▶ [Half-Open]
▲ │
└──────── successful test request ──────────────────────┘
│
└── failure ──▶ [Open]
- Closed: normal — calls go through.
- Open: failures exceeded threshold — fail fast, don’t even try.
- Half-Open: timeout expired — allow one trial; if good, close; if bad, re-open.
Why it matters
Without a breaker:
- Stripe is degraded; latency goes from 100ms to 30s.
- Your service still calls Stripe.
- All your threads/coroutines are stuck waiting on Stripe.
- New requests pile up; pool exhausts.
- Your service is now “down” too — even for non-Stripe paths.
With a breaker: after N failures, fail fast for non-Stripe paths to keep working.
Implementation
from enum import Enum
import time
class State(Enum): CLOSED = 1; OPEN = 2; HALF_OPEN = 3
class CircuitBreaker:
def __init__(self, fail_threshold=5, reset_timeout=30, half_open_calls=1):
self.fail_threshold = fail_threshold
self.reset_timeout = reset_timeout
self.half_open_calls = half_open_calls
self.state = State.CLOSED
self.failures = 0
self.opened_at = 0
async def call(self, fn, *args, **kwargs):
if self.state == State.OPEN:
if time.time() - self.opened_at > self.reset_timeout:
self.state = State.HALF_OPEN
else:
raise CircuitOpen()
try:
result = await fn(*args, **kwargs)
except Exception:
self._on_failure()
raise
else:
self._on_success()
return result
def _on_failure(self):
self.failures += 1
if self.failures >= self.fail_threshold:
self.state = State.OPEN
self.opened_at = time.time()
def _on_success(self):
self.failures = 0
self.state = State.CLOSED
Thresholds
| Setting | Typical | Notes |
|---|---|---|
| Failure threshold | 5–10 | Too low = flaps; too high = slow to react |
| Reset timeout | 10–60s | How long open before retry |
| Failure rate window | 10–30s | Use a rolling window, not all-time count |
| Half-open trials | 1–3 | Don’t flood the recovering dep |
For external dependencies: rolling-window failure rate (e.g., “50% in last 30s”) + minimum volume (e.g., “at least 20 calls”) is more robust than fixed counts.
Fallback
async def get_recommendations(user_id):
try:
return await breaker.call(rec_service.get, user_id)
except CircuitOpen:
return cached_default_recs()
When the breaker opens: degrade gracefully. Cached results, simpler heuristics, or “we’ll show you something else.”
If there’s no fallback worth showing: at least show a clean error rather than a 30s timeout.
Per-host breakers
For services calling many backends, one breaker per host:
breakers: dict[str, CircuitBreaker] = {}
async def call(host, fn, *args):
b = breakers.setdefault(host, CircuitBreaker())
return await b.call(fn, *args)
One backend going bad doesn’t cut off others.
Libraries
| Lang | Strengths | |
|---|---|---|
| resilience4j | Java | Comprehensive (breaker, retry, bulkhead, rate limit) |
| Polly | .NET | Mature; rich policies |
| gobreaker | Go | Simple, classic CB |
| py-breaker | Python | Lightweight |
| opossum | Node | Async-friendly |
| Hystrix | Java | Maintenance-only; new code → resilience4j |
Roll your own only if you need very specific behavior.
Combine with retries
async def fetch_with_resilience(url):
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=0.5))
async def go():
return await breaker.call(httpx.get, url)
return await go()
Retries handle transient errors; breaker prevents retry storms when failure is sustained.
Distributed breakers
For services on many replicas, each holds its own breaker state. A failing dependency triggers different replicas at different times.
For coordinated state, use a shared store (Redis) — but rarely needed. Per-replica breakers usually suffice.
Common failure modes
1. Breaker too aggressive
Trips on normal latency variance. Tune thresholds based on real failure data, not guesses.
2. Breaker hides upstream issues
“Errors went down!” Yes — because we’re failing fast. The dep is still broken. Alert on breaker state changes.
3. No fallback
Breaker opens; users see immediate errors. Plan a fallback.
4. Single breaker for many backends
Backend A is bad; calls to backend B are also blocked. Per-host.
5. No tests
Code never sees the open state in tests. Mock failures; verify behavior.
Bulkhead pattern
Related to breakers: bulkheads partition resources so one slow dep can’t drain all your threads.
sem = asyncio.Semaphore(50) # max 50 concurrent calls to slow_service
async def call_slow_service():
async with sem:
return await slow_service.call()
Even if slow_service hangs, only 50 of your goroutines/coroutines are stuck — the rest can serve other paths.
What I’d ship today
For each external dependency:
- Per-host breaker (failure rate ≥ 50% over 30s, min 20 calls).
- Bulkhead capping concurrent in-flight calls.
- Retry for transient errors with backoff + jitter.
- Timeout stricter than the dep’s typical p99.
- Fallback logic for when the breaker is open.
- Metrics on breaker state, failure rate.
- Alerts on breaker open events.
Read this next
- Idempotency, Retries, and Exactly-Once Illusions
- Rate Limiting Patterns 2026
- Observability Stack 2026
- Incident Response 2026
If you want my circuit breaker + bulkhead reference (Python + Go), it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .