When does single-Redis stop scaling?

Above ~1M req/sec sustained, a single Redis becomes a bottleneck. At billion-request CDN scale, you need a distributed design — typically hybrid local + global with eventual sync.

Local vs global rate limiting?

Local is fast (in-process) but each instance allows the limit independently — a fleet of 100 servers allows 100× the limit. Global is exact but requires coordination. Hybrid: local with periodic global sync.

Design a Distributed Rate Limiter at Scale — Beyond Single-Redis in 2026

The simple Redis-backed rate limiter (Design a Rate Limiter ) handles most apps. At Cloudflare / Stripe / Twitter scale — billions of requests per second across many regions — you need different shapes. This post is the working playbook.

When single-Redis breaks

1M+ req/sec: Redis is the bottleneck.
Multi-region: a Redis in us-east-1 adds 200ms latency for users in Bangalore.
Hot-key contention: one popular API key accumulates ops on one Redis shard.

Solutions stack: shard, distribute, approximate.

Shard by key hash

key → hash → shard (one of N Redis instances)

Each shard handles 1/N of keys. Each running its own Lua script for token-bucket logic. Add shards as load grows.

Limitation: a single hot key still hits one shard.

Hybrid local + global

Each app server has an in-memory rate limiter. It allows requests up to a fraction of the global limit. Periodically (e.g., every 100ms) it syncs with the global Redis and rebalances its local quota.

class HybridLimiter:
    local_count: int = 0
    last_sync: float = 0
    global_remaining: int = 0

    async def allow(self, key: str) -> bool:
        if time.time() - self.last_sync > 0.1:
            await self.sync(key)
        if self.local_count < self.global_remaining // num_servers:
            self.local_count += 1
            return True
        return False

Tradeoffs:

Pro: sub-millisecond decisions in-process.
Con: up to local_burst × num_servers overshoot in the worst case.

For non-billing rate limits (anti-abuse), the overshoot is acceptable. For billing limits, use exact global.

Sliding-window approximation

Exact sliding window stores every request timestamp — O(N) memory per key. Approximate with two counters:

def allow(key, limit, window_secs=60):
    now = time.time()
    bucket = int(now // window_secs)
    elapsed_in_bucket = (now % window_secs) / window_secs
    
    prev = redis.get(f"rl:{key}:{bucket-1}") or 0
    curr = redis.get(f"rl:{key}:{bucket}") or 0
    
    weighted = prev * (1 - elapsed_in_bucket) + curr
    if weighted >= limit:
        return False
    redis.incr(f"rl:{key}:{bucket}")
    redis.expire(f"rl:{key}:{bucket}", window_secs * 2)
    return True

O(1) memory per key. Smooth (no boundary bursts). Used by Cloudflare and others.

Multi-region

Two architectures:

Region-local

Each region runs its own rate limiter cluster. The “global” limit is per_region × num_regions. Simple; each region answers fast; no cross-region traffic.

Most CDNs do this. The slight overshoot is acceptable.

Eventually-consistent global

Each region maintains a local count + sends deltas to a global aggregator. Aggregator broadcasts back. Allows tighter global limit at the cost of complexity.

For 99% of services, region-local is the right call.

Fair queueing under saturation

When demand exceeds capacity, naive limiting punishes some users randomly. Stochastic Fair Queueing randomizes selection weighted by tenant size. Weighted Fair Queueing allocates capacity proportional to tier (free vs paid).

A modern AI gateway often does WFQ — paid users get more capacity than free.

Common mistakes at scale

1. Single global Redis

Won’t scale past ~1M req/sec. Shard.

2. Per-region exact

Cross-region traffic for every check. Latency disaster. Region-local.

3. Cache key without tenant

rl:1.2.3.4 is fine until a corporate NAT hits — 10k users behind one IP. Use tenant + IP.

4. No graceful degrade

Rate limiter goes down → fail open or fail closed? Decide consciously.

5. Hot-key without sharding

One API key = one Redis shard. Sub-shard hot keys.

When single-Redis breaks#

Shard by key hash#

Hybrid local + global#

Sliding-window approximation#

Multi-region#

Region-local#

Eventually-consistent global#

Fair queueing under saturation#

Common mistakes at scale#

1. Single global Redis#

2. Per-region exact#

3. Cache key without tenant#

4. No graceful degrade#

5. Hot-key without sharding#

Read this next#