A rate limiter looks simple — “allow N requests per minute.” It isn’t. The classic interview question hides a forest of choices: which algorithm, where the state lives, what to do during partitions, how to be fair, how to handle hot keys. Here’s the complete design.
What we’re building
A distributed rate limiter that, given a key (user ID, IP, API key) and a policy (e.g., “100 requests per minute”), decides allow / deny in <5ms p99, across many app servers, without becoming a single point of failure.
Where it sits
client → CDN/edge → API gateway ── (rate limit check) ── app servers
\
Redis / Valkey ←── shared state
Two common placements:
- At the gateway (Envoy, Kong, Cloudflare Workers, NGINX with
ngx_http_limit_req_module). Cheap to defend the whole fleet. - In the app (e.g., per-endpoint policies that need user context the gateway doesn’t know).
Most production systems use both: gateway for crude DDoS-style limits, app for fine-grained, business-aware limits.
The four algorithms
1. Fixed window counter
window = 12:00:00–12:00:59
counter[user] = 0
on request: if counter[user] < limit: allow, counter[user]++
on minute roll: reset counter
- Pros: Trivial, O(1).
- Cons: Burst at window boundary. A user can do 100 requests at 12:00:59 and another 100 at 12:01:00 — 200 requests in 1 second.
2. Sliding window log
Store the timestamp of every request in a sorted set. On each request, drop timestamps older than the window, count remaining, allow if count < limit.
- Pros: Exact. No boundary burst.
- Cons: O(log n) memory per user. A user making 10k req/min costs you 10k timestamps. Memory cost grows with traffic.
3. Sliding window counter (the practical one)
Approximate sliding window with two counters: current minute and previous minute, weighted.
elapsed = (now - window_start) / window_size # 0 to 1
count = previous * (1 - elapsed) + current
allow if count < limit
- Pros: Constant space (2 counters per user). Smooth — no boundary bursts.
- Cons: It’s an approximation. Acceptable accuracy for almost any real use case.
4. Token bucket (the gold standard)
A bucket holds up to B tokens. Tokens regenerate at rate R/sec. Each request consumes one token. If empty, deny.
tokens[user] = min(B, tokens[user] + R * (now - last_refill[user]))
last_refill[user] = now
if tokens[user] >= 1:
tokens[user] -= 1
allow
else:
deny
- Pros: Allows controlled bursts up to
B. Smooth steady-state. Composable (different buckets per scope). - Cons: Two state values per user. Slightly more complex.
My pick for new systems: token bucket. The ability to allow bursts (humans don’t request at exact rates) makes it feel right to users without breaking the limit on aggregate.
Leaky bucket — adjacent
A queue of fixed length; requests drain at fixed rate. Used in network shaping more than API rate limiting; mentioned for completeness.
Redis implementation (token bucket, atomic)
The naive token-bucket pseudo-code above has a race: two app servers could both read tokens=1, both decrement, both allow, leaving tokens=-1. The fix is a Lua script — Redis executes Lua atomically.
-- KEYS[1]: bucket key
-- ARGV[1]: max tokens (B)
-- ARGV[2]: refill rate (tokens/sec)
-- ARGV[3]: now (unix ms)
-- ARGV[4]: tokens to consume (usually 1)
-- Returns: {allowed (1/0), remaining_tokens, retry_after_ms}
local key = KEYS[1]
local cap = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local data = redis.call("HMGET", key, "tokens", "last_ms")
local tokens = tonumber(data[1])
local last_ms = tonumber(data[2])
if tokens == nil then
tokens = cap
last_ms = now
end
-- refill
local elapsed_ms = math.max(0, now - last_ms)
tokens = math.min(cap, tokens + (rate * elapsed_ms / 1000))
local allowed = 0
local retry_after = 0
if tokens >= cost then
tokens = tokens - cost
allowed = 1
else
-- ms until enough tokens accumulate
retry_after = math.ceil((cost - tokens) / rate * 1000)
end
redis.call("HMSET", key, "tokens", tokens, "last_ms", now)
redis.call("PEXPIRE", key, math.ceil(cap / rate * 1000) * 2)
return {allowed, math.floor(tokens), retry_after}
Calling it from Python with aioredis:
SCRIPT = open("token_bucket.lua").read()
class RateLimiter:
def __init__(self, redis, cap, rate):
self.redis = redis
self.cap = cap
self.rate = rate
self._sha = None
async def _load(self):
self._sha = await self.redis.script_load(SCRIPT)
async def check(self, key: str, cost: int = 1) -> tuple[bool, int, int]:
if self._sha is None:
await self._load()
now_ms = int(time.time() * 1000)
try:
allowed, remaining, retry_after = await self.redis.evalsha(
self._sha, 1, key, self.cap, self.rate, now_ms, cost,
)
except aioredis.NoScriptError:
await self._load()
return await self.check(key, cost)
return bool(allowed), remaining, retry_after
Three production details:
PEXPIRElets idle keys reclaim memory. Without it, memory grows forever.evalsha+ retry onNOSCRIPTsurvives Redis restarts (the script cache flushes).- Lua-side time would be cleaner, but Redis’s Lua time isn’t available everywhere — pass
now_msfrom the app.
This implementation does the limit check in one round trip at <1ms typical latency.
HTTP shape
When you deny, the response should be helpful:
HTTP/1.1 429 Too Many Requests
Retry-After: 12
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1717920660
Content-Type: application/json
{"error": "rate_limited", "message": "Try again in 12s"}
Retry-After(in seconds) is the standard header. Clients respect it.- The
X-RateLimit-*headers tell well-behaved clients how to pace. - Some APIs also send these headers on successful responses so clients can self-throttle before they hit a 429.
Distributing the limiter
A single Redis is a single point of failure. Three deployment patterns:
1. Single Redis (or Redis cluster)
The simplest. Pin all rate limiter keys to one Redis (or Redis Cluster). For most services, one well-sized Redis can do hundreds of thousands of ops/sec.
2. Sharded by key
Hash key → shard. Each shard is a separate Redis. Linear scale-out. Trivial to operate. Pitfall: a hot key concentrates load on one shard.
3. Local + global hybrid
Each app server has an in-memory limiter that handles 90% of traffic with a relaxed local count. Periodically sync to a global Redis to enforce the global limit and rebalance.
- Pros: Massive throughput. Low latency (in-memory).
- Cons: Up to a small overcount (each server can let through
local_burstextra requests until sync). Acceptable for non-billing limits.
For most APIs, single Redis or sharded Redis is plenty. Reach for hybrid only when you’ve measured an actual problem.
What about Redis going down?
You have three policy choices when the limiter store is unavailable:
- Fail open (allow). Best UX, worst safety.
- Fail closed (deny). Safe, awful UX during incidents.
- Fail static (use a fallback in-memory limiter on each node). The pragmatic middle.
I default to fail static: each app server keeps a coarse local limiter that activates when Redis errors persist for >5s. Logs a warning, paged on duration. Users see degradation, not outages.
Hot-key hazards
What if 10M users all share one key (e.g., ip:1.2.3.4 because they’re behind a corporate NAT)?
- Avoid IP-only keys for authenticated traffic. Prefer
user_id+ IP. - Multi-tier: a coarse global limit per IP, a finer limit per user. The IP one allows for shared NATs.
- Sharding within a single key: pick a “sub-shard” by hashing request ID, increment that, check sum periodically.
Fairness
A naive limiter is first-come-first-served — any single user can saturate. Weighted Fair Queuing (WFQ) gives bandwidth proportional to weight. Stochastic Fair Queuing (SFQ) randomizes to avoid starvation.
For most APIs you don’t need WFQ — per-user buckets are fair, in proportion to (limit × number of users). Reach for fairness algorithms when you have a multi-tenant system where one tenant’s bursts shouldn’t degrade another’s.
What’s surprising about deploying this
- Time can move backwards during NTP sync. Use monotonic time inside the app, server-supplied wall time only as a tiebreaker.
- Network round trips matter — at 1M req/s with a 1ms RTT to Redis, that’s 1000 simultaneous connections worth of pending I/O. Pool aggressively.
- TTL handling is where memory leaks live. Always set them.
- Test with chaos — kill Redis mid-traffic and see what your service does. The first test is usually surprising.
Where to put policy
The boring but important question. Three options:
- In code: a hash map of
endpoint → limit. Easy, requires deploy to change. - In a config file: same, but reloadable.
- In a control plane: a config service the gateway polls. Slowest to build, fastest to change limits in production.
Build the simplest first. Migrate when “I need to tighten this limit now” becomes a recurring incident response step.
Read this next
- Rate Limiting Strategies for APIs — production-side patterns.
- Designing REST APIs That Don’t Suck — what the 429 response should look like.
- Distributed Systems Fundamentals — the model behind the design.
If you want a working Redis-backed token-bucket library + middleware for FastAPI, Express, and Axum, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .