A circuit breaker stops your service from hammering a dying dependency. Done right, your failures stay isolated. Done wrong, the breaker itself becomes the bug. This post is the working playbook.

The pattern

[Closed] ── too many failures ──▶ [Open] ── timeout ──▶ [Half-Open]
   ▲                                                       │
   └──────── successful test request ──────────────────────┘
                                       └── failure ──▶ [Open]
  • Closed: normal — calls go through.
  • Open: failures exceeded threshold — fail fast, don’t even try.
  • Half-Open: timeout expired — allow one trial; if good, close; if bad, re-open.

Why it matters

Without a breaker:

  • Stripe is degraded; latency goes from 100ms to 30s.
  • Your service still calls Stripe.
  • All your threads/coroutines are stuck waiting on Stripe.
  • New requests pile up; pool exhausts.
  • Your service is now “down” too — even for non-Stripe paths.

With a breaker: after N failures, fail fast for non-Stripe paths to keep working.

Implementation

from enum import Enum
import time

class State(Enum): CLOSED = 1; OPEN = 2; HALF_OPEN = 3

class CircuitBreaker:
    def __init__(self, fail_threshold=5, reset_timeout=30, half_open_calls=1):
        self.fail_threshold = fail_threshold
        self.reset_timeout = reset_timeout
        self.half_open_calls = half_open_calls
        self.state = State.CLOSED
        self.failures = 0
        self.opened_at = 0
    
    async def call(self, fn, *args, **kwargs):
        if self.state == State.OPEN:
            if time.time() - self.opened_at > self.reset_timeout:
                self.state = State.HALF_OPEN
            else:
                raise CircuitOpen()
        
        try:
            result = await fn(*args, **kwargs)
        except Exception:
            self._on_failure()
            raise
        else:
            self._on_success()
            return result
    
    def _on_failure(self):
        self.failures += 1
        if self.failures >= self.fail_threshold:
            self.state = State.OPEN
            self.opened_at = time.time()
    
    def _on_success(self):
        self.failures = 0
        self.state = State.CLOSED

Thresholds

SettingTypicalNotes
Failure threshold5–10Too low = flaps; too high = slow to react
Reset timeout10–60sHow long open before retry
Failure rate window10–30sUse a rolling window, not all-time count
Half-open trials1–3Don’t flood the recovering dep

For external dependencies: rolling-window failure rate (e.g., “50% in last 30s”) + minimum volume (e.g., “at least 20 calls”) is more robust than fixed counts.

Fallback

async def get_recommendations(user_id):
    try:
        return await breaker.call(rec_service.get, user_id)
    except CircuitOpen:
        return cached_default_recs()

When the breaker opens: degrade gracefully. Cached results, simpler heuristics, or “we’ll show you something else.”

If there’s no fallback worth showing: at least show a clean error rather than a 30s timeout.

Per-host breakers

For services calling many backends, one breaker per host:

breakers: dict[str, CircuitBreaker] = {}

async def call(host, fn, *args):
    b = breakers.setdefault(host, CircuitBreaker())
    return await b.call(fn, *args)

One backend going bad doesn’t cut off others.

Libraries

LangStrengths
resilience4jJavaComprehensive (breaker, retry, bulkhead, rate limit)
Polly.NETMature; rich policies
gobreakerGoSimple, classic CB
py-breakerPythonLightweight
opossumNodeAsync-friendly
HystrixJavaMaintenance-only; new code → resilience4j

Roll your own only if you need very specific behavior.

Combine with retries

async def fetch_with_resilience(url):
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=0.5))
    async def go():
        return await breaker.call(httpx.get, url)
    return await go()

Retries handle transient errors; breaker prevents retry storms when failure is sustained.

Distributed breakers

For services on many replicas, each holds its own breaker state. A failing dependency triggers different replicas at different times.

For coordinated state, use a shared store (Redis) — but rarely needed. Per-replica breakers usually suffice.

Common failure modes

1. Breaker too aggressive

Trips on normal latency variance. Tune thresholds based on real failure data, not guesses.

2. Breaker hides upstream issues

“Errors went down!” Yes — because we’re failing fast. The dep is still broken. Alert on breaker state changes.

3. No fallback

Breaker opens; users see immediate errors. Plan a fallback.

4. Single breaker for many backends

Backend A is bad; calls to backend B are also blocked. Per-host.

5. No tests

Code never sees the open state in tests. Mock failures; verify behavior.

Bulkhead pattern

Related to breakers: bulkheads partition resources so one slow dep can’t drain all your threads.

sem = asyncio.Semaphore(50)  # max 50 concurrent calls to slow_service

async def call_slow_service():
    async with sem:
        return await slow_service.call()

Even if slow_service hangs, only 50 of your goroutines/coroutines are stuck — the rest can serve other paths.

What I’d ship today

For each external dependency:

  • Per-host breaker (failure rate ≥ 50% over 30s, min 20 calls).
  • Bulkhead capping concurrent in-flight calls.
  • Retry for transient errors with backoff + jitter.
  • Timeout stricter than the dep’s typical p99.
  • Fallback logic for when the breaker is open.
  • Metrics on breaker state, failure rate.
  • Alerts on breaker open events.

Read this next

If you want my circuit breaker + bulkhead reference (Python + Go), it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .