A service that doesn’t handle dependency failure gracefully isn’t a production service — it’s a single-point-of-failure. The five resilience patterns below are non-negotiable for systems that survive Tuesday afternoons.

1. Timeouts (the foundation)

async with httpx.AsyncClient(timeout=httpx.Timeout(5.0, connect=2.0)) as client:
    resp = await client.get(url)

No timeout = unbounded wait = worker pool exhaustion = outage. Every external call needs a timeout.

Pick by SLA — for a service with a 1s p99 budget, downstream calls should timeout at <500ms.

2. Retries (with jitter)

async def with_retry(fn, max=3, base=0.1):
    for i in range(max):
        try:
            return await fn()
        except RetryableError:
            if i == max - 1: raise
            await asyncio.sleep(random.uniform(0, base * 2**i))

Exponential + jitter. Never retry without jitter — your fleet retries in lockstep and DDoSes the recovering downstream.

For the deeper retry math see Idempotency, Retries, and Exactly-Once Illusions .

3. Circuit breakers

breaker = CircuitBreaker(fail_threshold=5, reset_timeout=30)

async def call_external():
    if breaker.is_open():
        raise ServiceUnavailable("circuit open")
    try:
        result = await external_api()
        breaker.record_success()
        return result
    except Exception:
        breaker.record_failure()
        raise

States: closed (normal), open (failing fast), half-open (probe). Open trips after N failures; reset after a timeout; half-open lets a few requests through to test recovery.

Why: a flaky downstream that takes 30s to time out × 100 concurrent requests = 100 workers tied up. A circuit breaker fails them in microseconds, freeing workers for healthy paths.

Libraries: pybreaker (Python), resilience4j (Java), gobreaker (Go), circuit-breaker-rs (Rust).

4. Bulkheads

Separate connection pools / thread pools per downstream:

db_pool = create_pool(size=10)            # DB calls
billing_pool = create_pool(size=5)        # billing API
search_pool = create_pool(size=8)         # search service

If billing API hangs, only the 5 billing slots fill. DB and search keep working. Without bulkheads, one slow downstream eats your whole worker pool.

For Kubernetes-level bulkheading, isolate workloads via separate namespaces / node pools.

5. Backpressure

Producers can outrun consumers. Without backpressure, queues grow unbounded → memory blows → service crashes.

Mechanisms:

  • Bounded queues (block when full).
  • Rate limiting (Design a Rate Limiter ).
  • Load shedding at the edge.
  • Adaptive concurrency (TCP-style AIMD — increase workers until errors, then back off).

Bounded everywhere by default. Unbounded queues are the most common production footgun.

Composing them

Real production calls compose all five:

async def call_billing(payload):
    async with billing_pool.acquire() as conn:                # bulkhead
        if billing_breaker.is_open():                          # circuit breaker
            raise ServiceUnavailable()
        try:
            async with asyncio.timeout(2.0):                   # timeout
                return await retry_with_jitter(                # retries
                    lambda: conn.post(url, json=payload),
                    max=3,
                )
        except ServiceUnavailable:
            billing_breaker.record_failure()
            raise

Verbose, but each layer protects different failure modes. Library wrappers (tenacity + a circuit breaker decorator + connection-pool manager) make this concise.

Deadline propagation

If the user’s request has a 1-second deadline, propagate that deadline to downstream calls:

@app.get("/order/{id}")
async def get_order(id: int, request: Request):
    deadline = request.state.deadline                          # set by middleware
    remaining = deadline - time.monotonic()
    async with asyncio.timeout(remaining * 0.8):
        return await fetch_order(id)

Every downstream knows how much time it has left. Slow paths abandon early instead of completing work the user no longer waits for.

gRPC has this built in (deadlines propagate via headers). For HTTP, propagate via a header (X-Request-Deadline) and middleware.

Common mistakes

1. Retries without jitter

Stampede.

2. Retries on non-retryable errors

400s never become 200s. Don’t retry them.

3. No timeout on the outermost call

A 5s downstream timeout × 5 retries = 25s caller wait. Cap the total budget.

4. Breaker too sensitive

5 failures = open trips on a transient blip. Tune fail_threshold to your domain.

5. No bulkheads

One bad dependency drains the whole worker pool. The classic cascading-failure trigger.

What I’d ship today

For a new service:

  1. Timeout on every external call.
  2. Retries with jitter, max 3, only on retryable errors.
  3. Circuit breaker on every external dependency.
  4. Separate connection pools per downstream.
  5. Bounded queues / rate limits at ingress.
  6. Deadlines propagated through middleware.

Boring. Effective. Saves you when the inevitable downstream incident hits.

Read this next

If you want a Python resilience library wrapping all five patterns, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .