Design a Rate Limiter — Algorithms, Distributed Implementation, and the Tradeoffs

A rate limiter looks simple — “allow N requests per minute.” It isn’t. The classic interview question hides a forest of choices: which algorithm, where the state lives, what to do during partitions, how to be fair, how to handle hot keys. Here’s the complete design.

What we’re building

A distributed rate limiter that, given a key (user ID, IP, API key) and a policy (e.g., “100 requests per minute”), decides allow / deny in <5ms p99, across many app servers, without becoming a single point of failure.

Where it sits

client → CDN/edge → API gateway ── (rate limit check) ── app servers
                              \
                               Redis / Valkey  ←── shared state

Two common placements:

At the gateway (Envoy, Kong, Cloudflare Workers, NGINX with ngx_http_limit_req_module). Cheap to defend the whole fleet.
In the app (e.g., per-endpoint policies that need user context the gateway doesn’t know).

Most production systems use both: gateway for crude DDoS-style limits, app for fine-grained, business-aware limits.

The four algorithms

1. Fixed window counter

window = 12:00:00–12:00:59
counter[user] = 0
on request: if counter[user] < limit: allow, counter[user]++
on minute roll: reset counter

Pros: Trivial, O(1).
Cons: Burst at window boundary. A user can do 100 requests at 12:00:59 and another 100 at 12:01:00 — 200 requests in 1 second.

2. Sliding window log

Store the timestamp of every request in a sorted set. On each request, drop timestamps older than the window, count remaining, allow if count < limit.

Pros: Exact. No boundary burst.
Cons: O(log n) memory per user. A user making 10k req/min costs you 10k timestamps. Memory cost grows with traffic.

3. Sliding window counter (the practical one)

Approximate sliding window with two counters: current minute and previous minute, weighted.

elapsed = (now - window_start) / window_size  # 0 to 1
count = previous * (1 - elapsed) + current
allow if count < limit

Pros: Constant space (2 counters per user). Smooth — no boundary bursts.
Cons: It’s an approximation. Acceptable accuracy for almost any real use case.

4. Token bucket (the gold standard)

A bucket holds up to B tokens. Tokens regenerate at rate R/sec. Each request consumes one token. If empty, deny.

tokens[user] = min(B, tokens[user] + R * (now - last_refill[user]))
last_refill[user] = now
if tokens[user] >= 1:
    tokens[user] -= 1
    allow
else:
    deny

Pros: Allows controlled bursts up to B. Smooth steady-state. Composable (different buckets per scope).
Cons: Two state values per user. Slightly more complex.

My pick for new systems: token bucket. The ability to allow bursts (humans don’t request at exact rates) makes it feel right to users without breaking the limit on aggregate.

Leaky bucket — adjacent

A queue of fixed length; requests drain at fixed rate. Used in network shaping more than API rate limiting; mentioned for completeness.

Redis implementation (token bucket, atomic)

The naive token-bucket pseudo-code above has a race: two app servers could both read tokens=1, both decrement, both allow, leaving tokens=-1. The fix is a Lua script — Redis executes Lua atomically.

-- KEYS[1]: bucket key
-- ARGV[1]: max tokens (B)
-- ARGV[2]: refill rate (tokens/sec)
-- ARGV[3]: now (unix ms)
-- ARGV[4]: tokens to consume (usually 1)
-- Returns: {allowed (1/0), remaining_tokens, retry_after_ms}

local key       = KEYS[1]
local cap       = tonumber(ARGV[1])
local rate      = tonumber(ARGV[2])
local now       = tonumber(ARGV[3])
local cost      = tonumber(ARGV[4])

local data = redis.call("HMGET", key, "tokens", "last_ms")
local tokens = tonumber(data[1])
local last_ms = tonumber(data[2])

if tokens == nil then
  tokens = cap
  last_ms = now
end

-- refill
local elapsed_ms = math.max(0, now - last_ms)
tokens = math.min(cap, tokens + (rate * elapsed_ms / 1000))

local allowed = 0
local retry_after = 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
else
  -- ms until enough tokens accumulate
  retry_after = math.ceil((cost - tokens) / rate * 1000)
end

redis.call("HMSET", key, "tokens", tokens, "last_ms", now)
redis.call("PEXPIRE", key, math.ceil(cap / rate * 1000) * 2)

return {allowed, math.floor(tokens), retry_after}

Calling it from Python with aioredis:

SCRIPT = open("token_bucket.lua").read()

class RateLimiter:
    def __init__(self, redis, cap, rate):
        self.redis = redis
        self.cap = cap
        self.rate = rate
        self._sha = None

    async def _load(self):
        self._sha = await self.redis.script_load(SCRIPT)

    async def check(self, key: str, cost: int = 1) -> tuple[bool, int, int]:
        if self._sha is None:
            await self._load()
        now_ms = int(time.time() * 1000)
        try:
            allowed, remaining, retry_after = await self.redis.evalsha(
                self._sha, 1, key, self.cap, self.rate, now_ms, cost,
            )
        except aioredis.NoScriptError:
            await self._load()
            return await self.check(key, cost)
        return bool(allowed), remaining, retry_after

Three production details:

PEXPIRE lets idle keys reclaim memory. Without it, memory grows forever.
evalsha + retry on NOSCRIPT survives Redis restarts (the script cache flushes).
Lua-side time would be cleaner, but Redis’s Lua time isn’t available everywhere — pass now_ms from the app.

This implementation does the limit check in one round trip at <1ms typical latency.

HTTP shape

When you deny, the response should be helpful:

HTTP/1.1 429 Too Many Requests
Retry-After: 12
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1717920660
Content-Type: application/json

{"error": "rate_limited", "message": "Try again in 12s"}

Retry-After (in seconds) is the standard header. Clients respect it.
The X-RateLimit-* headers tell well-behaved clients how to pace.
Some APIs also send these headers on successful responses so clients can self-throttle before they hit a 429.

Distributing the limiter

A single Redis is a single point of failure. Three deployment patterns:

1. Single Redis (or Redis cluster)

The simplest. Pin all rate limiter keys to one Redis (or Redis Cluster). For most services, one well-sized Redis can do hundreds of thousands of ops/sec.

2. Sharded by key

Hash key → shard. Each shard is a separate Redis. Linear scale-out. Trivial to operate. Pitfall: a hot key concentrates load on one shard.

3. Local + global hybrid

Each app server has an in-memory limiter that handles 90% of traffic with a relaxed local count. Periodically sync to a global Redis to enforce the global limit and rebalance.

Pros: Massive throughput. Low latency (in-memory).
Cons: Up to a small overcount (each server can let through local_burst extra requests until sync). Acceptable for non-billing limits.

For most APIs, single Redis or sharded Redis is plenty. Reach for hybrid only when you’ve measured an actual problem.

What about Redis going down?

You have three policy choices when the limiter store is unavailable:

Fail open (allow). Best UX, worst safety.
Fail closed (deny). Safe, awful UX during incidents.
Fail static (use a fallback in-memory limiter on each node). The pragmatic middle.

I default to fail static: each app server keeps a coarse local limiter that activates when Redis errors persist for >5s. Logs a warning, paged on duration. Users see degradation, not outages.

Hot-key hazards

What if 10M users all share one key (e.g., ip:1.2.3.4 because they’re behind a corporate NAT)?

Avoid IP-only keys for authenticated traffic. Prefer user_id + IP.
Multi-tier: a coarse global limit per IP, a finer limit per user. The IP one allows for shared NATs.
Sharding within a single key: pick a “sub-shard” by hashing request ID, increment that, check sum periodically.

Fairness

A naive limiter is first-come-first-served — any single user can saturate. Weighted Fair Queuing (WFQ) gives bandwidth proportional to weight. Stochastic Fair Queuing (SFQ) randomizes to avoid starvation.

For most APIs you don’t need WFQ — per-user buckets are fair, in proportion to (limit × number of users). Reach for fairness algorithms when you have a multi-tenant system where one tenant’s bursts shouldn’t degrade another’s.

What’s surprising about deploying this

Time can move backwards during NTP sync. Use monotonic time inside the app, server-supplied wall time only as a tiebreaker.
Network round trips matter — at 1M req/s with a 1ms RTT to Redis, that’s 1000 simultaneous connections worth of pending I/O. Pool aggressively.
TTL handling is where memory leaks live. Always set them.
Test with chaos — kill Redis mid-traffic and see what your service does. The first test is usually surprising.

Where to put policy

The boring but important question. Three options:

In code: a hash map of endpoint → limit. Easy, requires deploy to change.
In a config file: same, but reloadable.
In a control plane: a config service the gateway polls. Slowest to build, fastest to change limits in production.

Build the simplest first. Migrate when “I need to tighten this limit now” becomes a recurring incident response step.

What we’re building#

Where it sits#

The four algorithms#

1. Fixed window counter#

2. Sliding window log#

3. Sliding window counter (the practical one)#

4. Token bucket (the gold standard)#

Leaky bucket — adjacent#

Redis implementation (token bucket, atomic)#

HTTP shape#

Distributing the limiter#

1. Single Redis (or Redis cluster)#

2. Sharded by key#

3. Local + global hybrid#

What about Redis going down?#

Hot-key hazards#

Fairness#

What’s surprising about deploying this#

Where to put policy#

Read this next#