Do I need a circuit breaker for every dependency?

Critical dependencies — yes. Optional dependencies — yes (fail fast, degrade gracefully). Internal calls within the same service — usually no, simple timeouts suffice.

Circuit breaker or just retries?

Both. Retries handle transient failures; breakers prevent retry storms when the dependency is genuinely down. Pair them: retries with breaker as the outer guard.

Circuit Breakers in 2026 — Patterns, Pitfalls, and When They Save You

A circuit breaker stops your service from hammering a dying dependency. Done right, your failures stay isolated. Done wrong, the breaker itself becomes the bug. This post is the working playbook.

The pattern

[Closed] ── too many failures ──▶ [Open] ── timeout ──▶ [Half-Open]
   ▲                                                       │
   └──────── successful test request ──────────────────────┘
                                       │
                                       └── failure ──▶ [Open]

Closed: normal — calls go through.
Open: failures exceeded threshold — fail fast, don’t even try.
Half-Open: timeout expired — allow one trial; if good, close; if bad, re-open.

Why it matters

Without a breaker:

Stripe is degraded; latency goes from 100ms to 30s.
Your service still calls Stripe.
All your threads/coroutines are stuck waiting on Stripe.
New requests pile up; pool exhausts.
Your service is now “down” too — even for non-Stripe paths.

With a breaker: after N failures, fail fast for non-Stripe paths to keep working.

Implementation

from enum import Enum
import time

class State(Enum): CLOSED = 1; OPEN = 2; HALF_OPEN = 3

class CircuitBreaker:
    def __init__(self, fail_threshold=5, reset_timeout=30, half_open_calls=1):
        self.fail_threshold = fail_threshold
        self.reset_timeout = reset_timeout
        self.half_open_calls = half_open_calls
        self.state = State.CLOSED
        self.failures = 0
        self.opened_at = 0
    
    async def call(self, fn, *args, **kwargs):
        if self.state == State.OPEN:
            if time.time() - self.opened_at > self.reset_timeout:
                self.state = State.HALF_OPEN
            else:
                raise CircuitOpen()
        
        try:
            result = await fn(*args, **kwargs)
        except Exception:
            self._on_failure()
            raise
        else:
            self._on_success()
            return result
    
    def _on_failure(self):
        self.failures += 1
        if self.failures >= self.fail_threshold:
            self.state = State.OPEN
            self.opened_at = time.time()
    
    def _on_success(self):
        self.failures = 0
        self.state = State.CLOSED

Thresholds

Setting	Typical	Notes
Failure threshold	5–10	Too low = flaps; too high = slow to react
Reset timeout	10–60s	How long open before retry
Failure rate window	10–30s	Use a rolling window, not all-time count
Half-open trials	1–3	Don’t flood the recovering dep

For external dependencies: rolling-window failure rate (e.g., “50% in last 30s”) + minimum volume (e.g., “at least 20 calls”) is more robust than fixed counts.

Fallback

async def get_recommendations(user_id):
    try:
        return await breaker.call(rec_service.get, user_id)
    except CircuitOpen:
        return cached_default_recs()

When the breaker opens: degrade gracefully. Cached results, simpler heuristics, or “we’ll show you something else.”

If there’s no fallback worth showing: at least show a clean error rather than a 30s timeout.

Per-host breakers

For services calling many backends, one breaker per host:

breakers: dict[str, CircuitBreaker] = {}

async def call(host, fn, *args):
    b = breakers.setdefault(host, CircuitBreaker())
    return await b.call(fn, *args)

One backend going bad doesn’t cut off others.

Libraries

	Lang	Strengths
resilience4j	Java	Comprehensive (breaker, retry, bulkhead, rate limit)
Polly	.NET	Mature; rich policies
gobreaker	Go	Simple, classic CB
py-breaker	Python	Lightweight
opossum	Node	Async-friendly
Hystrix	Java	Maintenance-only; new code → resilience4j

Roll your own only if you need very specific behavior.

Combine with retries

async def fetch_with_resilience(url):
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=0.5))
    async def go():
        return await breaker.call(httpx.get, url)
    return await go()

Retries handle transient errors; breaker prevents retry storms when failure is sustained.

Distributed breakers

For services on many replicas, each holds its own breaker state. A failing dependency triggers different replicas at different times.

For coordinated state, use a shared store (Redis) — but rarely needed. Per-replica breakers usually suffice.

Common failure modes

1. Breaker too aggressive

Trips on normal latency variance. Tune thresholds based on real failure data, not guesses.

2. Breaker hides upstream issues

“Errors went down!” Yes — because we’re failing fast. The dep is still broken. Alert on breaker state changes.

3. No fallback

Breaker opens; users see immediate errors. Plan a fallback.

4. Single breaker for many backends

Backend A is bad; calls to backend B are also blocked. Per-host.

5. No tests

Code never sees the open state in tests. Mock failures; verify behavior.

Bulkhead pattern

Related to breakers: bulkheads partition resources so one slow dep can’t drain all your threads.

sem = asyncio.Semaphore(50)  # max 50 concurrent calls to slow_service

async def call_slow_service():
    async with sem:
        return await slow_service.call()

Even if slow_service hangs, only 50 of your goroutines/coroutines are stuck — the rest can serve other paths.

What I’d ship today

For each external dependency:

Per-host breaker (failure rate ≥ 50% over 30s, min 20 calls).
Bulkhead capping concurrent in-flight calls.
Retry for transient errors with backoff + jitter.
Timeout stricter than the dep’s typical p99.
Fallback logic for when the breaker is open.
Metrics on breaker state, failure rate.
Alerts on breaker open events.

Read this next

If you want my circuit breaker + bulkhead reference (Python + Go), it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

The pattern#

Why it matters#

Implementation#

Thresholds#

Fallback#

Per-host breakers#

Libraries#

Combine with retries#

Distributed breakers#

Common failure modes#

1. Breaker too aggressive#

2. Breaker hides upstream issues#

3. No fallback#

4. Single breaker for many backends#

5. No tests#

Bulkhead pattern#

What I’d ship today#

Read this next#