A service that doesn’t handle dependency failure gracefully isn’t a production service — it’s a single-point-of-failure. The five resilience patterns below are non-negotiable for systems that survive Tuesday afternoons.
1. Timeouts (the foundation)
async with httpx.AsyncClient(timeout=httpx.Timeout(5.0, connect=2.0)) as client:
resp = await client.get(url)
No timeout = unbounded wait = worker pool exhaustion = outage. Every external call needs a timeout.
Pick by SLA — for a service with a 1s p99 budget, downstream calls should timeout at <500ms.
2. Retries (with jitter)
async def with_retry(fn, max=3, base=0.1):
for i in range(max):
try:
return await fn()
except RetryableError:
if i == max - 1: raise
await asyncio.sleep(random.uniform(0, base * 2**i))
Exponential + jitter. Never retry without jitter — your fleet retries in lockstep and DDoSes the recovering downstream.
For the deeper retry math see Idempotency, Retries, and Exactly-Once Illusions .
3. Circuit breakers
breaker = CircuitBreaker(fail_threshold=5, reset_timeout=30)
async def call_external():
if breaker.is_open():
raise ServiceUnavailable("circuit open")
try:
result = await external_api()
breaker.record_success()
return result
except Exception:
breaker.record_failure()
raise
States: closed (normal), open (failing fast), half-open (probe). Open trips after N failures; reset after a timeout; half-open lets a few requests through to test recovery.
Why: a flaky downstream that takes 30s to time out × 100 concurrent requests = 100 workers tied up. A circuit breaker fails them in microseconds, freeing workers for healthy paths.
Libraries: pybreaker (Python), resilience4j (Java), gobreaker (Go), circuit-breaker-rs (Rust).
4. Bulkheads
Separate connection pools / thread pools per downstream:
db_pool = create_pool(size=10) # DB calls
billing_pool = create_pool(size=5) # billing API
search_pool = create_pool(size=8) # search service
If billing API hangs, only the 5 billing slots fill. DB and search keep working. Without bulkheads, one slow downstream eats your whole worker pool.
For Kubernetes-level bulkheading, isolate workloads via separate namespaces / node pools.
5. Backpressure
Producers can outrun consumers. Without backpressure, queues grow unbounded → memory blows → service crashes.
Mechanisms:
- Bounded queues (block when full).
- Rate limiting (Design a Rate Limiter ).
- Load shedding at the edge.
- Adaptive concurrency (TCP-style AIMD — increase workers until errors, then back off).
Bounded everywhere by default. Unbounded queues are the most common production footgun.
Composing them
Real production calls compose all five:
async def call_billing(payload):
async with billing_pool.acquire() as conn: # bulkhead
if billing_breaker.is_open(): # circuit breaker
raise ServiceUnavailable()
try:
async with asyncio.timeout(2.0): # timeout
return await retry_with_jitter( # retries
lambda: conn.post(url, json=payload),
max=3,
)
except ServiceUnavailable:
billing_breaker.record_failure()
raise
Verbose, but each layer protects different failure modes. Library wrappers (tenacity + a circuit breaker decorator + connection-pool manager) make this concise.
Deadline propagation
If the user’s request has a 1-second deadline, propagate that deadline to downstream calls:
@app.get("/order/{id}")
async def get_order(id: int, request: Request):
deadline = request.state.deadline # set by middleware
remaining = deadline - time.monotonic()
async with asyncio.timeout(remaining * 0.8):
return await fetch_order(id)
Every downstream knows how much time it has left. Slow paths abandon early instead of completing work the user no longer waits for.
gRPC has this built in (deadlines propagate via headers). For HTTP, propagate via a header (X-Request-Deadline) and middleware.
Common mistakes
1. Retries without jitter
Stampede.
2. Retries on non-retryable errors
400s never become 200s. Don’t retry them.
3. No timeout on the outermost call
A 5s downstream timeout × 5 retries = 25s caller wait. Cap the total budget.
4. Breaker too sensitive
5 failures = open trips on a transient blip. Tune fail_threshold to your domain.
5. No bulkheads
One bad dependency drains the whole worker pool. The classic cascading-failure trigger.
What I’d ship today
For a new service:
- Timeout on every external call.
- Retries with jitter, max 3, only on retryable errors.
- Circuit breaker on every external dependency.
- Separate connection pools per downstream.
- Bounded queues / rate limits at ingress.
- Deadlines propagated through middleware.
Boring. Effective. Saves you when the inevitable downstream incident hits.
Read this next
- Distributed Systems Fundamentals
- Idempotency, Retries, and Exactly-Once Illusions
- SLOs and Error Budgets for App Developers
- Design a Rate Limiter
If you want a Python resilience library wrapping all five patterns, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .