The simple Redis-backed rate limiter (Design a Rate Limiter ) handles most apps. At Cloudflare / Stripe / Twitter scale — billions of requests per second across many regions — you need different shapes. This post is the working playbook.

When single-Redis breaks

  • 1M+ req/sec: Redis is the bottleneck.
  • Multi-region: a Redis in us-east-1 adds 200ms latency for users in Bangalore.
  • Hot-key contention: one popular API key accumulates ops on one Redis shard.

Solutions stack: shard, distribute, approximate.

Shard by key hash

key → hash → shard (one of N Redis instances)

Each shard handles 1/N of keys. Each running its own Lua script for token-bucket logic. Add shards as load grows.

Limitation: a single hot key still hits one shard.

Hybrid local + global

Each app server has an in-memory rate limiter. It allows requests up to a fraction of the global limit. Periodically (e.g., every 100ms) it syncs with the global Redis and rebalances its local quota.

class HybridLimiter:
    local_count: int = 0
    last_sync: float = 0
    global_remaining: int = 0

    async def allow(self, key: str) -> bool:
        if time.time() - self.last_sync > 0.1:
            await self.sync(key)
        if self.local_count < self.global_remaining // num_servers:
            self.local_count += 1
            return True
        return False

Tradeoffs:

  • Pro: sub-millisecond decisions in-process.
  • Con: up to local_burst × num_servers overshoot in the worst case.

For non-billing rate limits (anti-abuse), the overshoot is acceptable. For billing limits, use exact global.

Sliding-window approximation

Exact sliding window stores every request timestamp — O(N) memory per key. Approximate with two counters:

def allow(key, limit, window_secs=60):
    now = time.time()
    bucket = int(now // window_secs)
    elapsed_in_bucket = (now % window_secs) / window_secs
    
    prev = redis.get(f"rl:{key}:{bucket-1}") or 0
    curr = redis.get(f"rl:{key}:{bucket}") or 0
    
    weighted = prev * (1 - elapsed_in_bucket) + curr
    if weighted >= limit:
        return False
    redis.incr(f"rl:{key}:{bucket}")
    redis.expire(f"rl:{key}:{bucket}", window_secs * 2)
    return True

O(1) memory per key. Smooth (no boundary bursts). Used by Cloudflare and others.

Multi-region

Two architectures:

Region-local

Each region runs its own rate limiter cluster. The “global” limit is per_region × num_regions. Simple; each region answers fast; no cross-region traffic.

Most CDNs do this. The slight overshoot is acceptable.

Eventually-consistent global

Each region maintains a local count + sends deltas to a global aggregator. Aggregator broadcasts back. Allows tighter global limit at the cost of complexity.

For 99% of services, region-local is the right call.

Fair queueing under saturation

When demand exceeds capacity, naive limiting punishes some users randomly. Stochastic Fair Queueing randomizes selection weighted by tenant size. Weighted Fair Queueing allocates capacity proportional to tier (free vs paid).

A modern AI gateway often does WFQ — paid users get more capacity than free.

Common mistakes at scale

1. Single global Redis

Won’t scale past ~1M req/sec. Shard.

2. Per-region exact

Cross-region traffic for every check. Latency disaster. Region-local.

3. Cache key without tenant

rl:1.2.3.4 is fine until a corporate NAT hits — 10k users behind one IP. Use tenant + IP.

4. No graceful degrade

Rate limiter goes down → fail open or fail closed? Decide consciously.

5. Hot-key without sharding

One API key = one Redis shard. Sub-shard hot keys.

Read this next

If you want a hybrid local/global rate-limiter library, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .