The simple Redis-backed rate limiter (Design a Rate Limiter ) handles most apps. At Cloudflare / Stripe / Twitter scale — billions of requests per second across many regions — you need different shapes. This post is the working playbook.
When single-Redis breaks
- 1M+ req/sec: Redis is the bottleneck.
- Multi-region: a Redis in
us-east-1adds 200ms latency for users in Bangalore. - Hot-key contention: one popular API key accumulates ops on one Redis shard.
Solutions stack: shard, distribute, approximate.
Shard by key hash
key → hash → shard (one of N Redis instances)
Each shard handles 1/N of keys. Each running its own Lua script for token-bucket logic. Add shards as load grows.
Limitation: a single hot key still hits one shard.
Hybrid local + global
Each app server has an in-memory rate limiter. It allows requests up to a fraction of the global limit. Periodically (e.g., every 100ms) it syncs with the global Redis and rebalances its local quota.
class HybridLimiter:
local_count: int = 0
last_sync: float = 0
global_remaining: int = 0
async def allow(self, key: str) -> bool:
if time.time() - self.last_sync > 0.1:
await self.sync(key)
if self.local_count < self.global_remaining // num_servers:
self.local_count += 1
return True
return False
Tradeoffs:
- Pro: sub-millisecond decisions in-process.
- Con: up to
local_burst × num_serversovershoot in the worst case.
For non-billing rate limits (anti-abuse), the overshoot is acceptable. For billing limits, use exact global.
Sliding-window approximation
Exact sliding window stores every request timestamp — O(N) memory per key. Approximate with two counters:
def allow(key, limit, window_secs=60):
now = time.time()
bucket = int(now // window_secs)
elapsed_in_bucket = (now % window_secs) / window_secs
prev = redis.get(f"rl:{key}:{bucket-1}") or 0
curr = redis.get(f"rl:{key}:{bucket}") or 0
weighted = prev * (1 - elapsed_in_bucket) + curr
if weighted >= limit:
return False
redis.incr(f"rl:{key}:{bucket}")
redis.expire(f"rl:{key}:{bucket}", window_secs * 2)
return True
O(1) memory per key. Smooth (no boundary bursts). Used by Cloudflare and others.
Multi-region
Two architectures:
Region-local
Each region runs its own rate limiter cluster. The “global” limit is per_region × num_regions. Simple; each region answers fast; no cross-region traffic.
Most CDNs do this. The slight overshoot is acceptable.
Eventually-consistent global
Each region maintains a local count + sends deltas to a global aggregator. Aggregator broadcasts back. Allows tighter global limit at the cost of complexity.
For 99% of services, region-local is the right call.
Fair queueing under saturation
When demand exceeds capacity, naive limiting punishes some users randomly. Stochastic Fair Queueing randomizes selection weighted by tenant size. Weighted Fair Queueing allocates capacity proportional to tier (free vs paid).
A modern AI gateway often does WFQ — paid users get more capacity than free.
Common mistakes at scale
1. Single global Redis
Won’t scale past ~1M req/sec. Shard.
2. Per-region exact
Cross-region traffic for every check. Latency disaster. Region-local.
3. Cache key without tenant
rl:1.2.3.4 is fine until a corporate NAT hits — 10k users behind one IP. Use tenant + IP.
4. No graceful degrade
Rate limiter goes down → fail open or fail closed? Decide consciously.
5. Hot-key without sharding
One API key = one Redis shard. Sub-shard hot keys.
Read this next
- Design a Rate Limiter — the single-Redis foundation.
- Caching Strategies in 2026
- Distributed Systems Fundamentals
- AI Gateways in 2026
If you want a hybrid local/global rate-limiter library, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .