Rate limiting is the boring infrastructure that prevents abuse, controls costs, and enforces fair use. Done right, no one notices. Done wrong, real users get blocked while bots get through. This post is the working design.
Phase 1: in-memory single-node
from collections import defaultdict
import time
class TokenBucket:
def __init__(self, rate: float, burst: int):
self.rate = rate; self.burst = burst
self.buckets = {}
def allow(self, key: str) -> bool:
now = time.time()
tokens, last = self.buckets.get(key, (self.burst, now))
tokens = min(self.burst, tokens + (now - last) * self.rate)
if tokens >= 1:
self.buckets[key] = (tokens - 1, now)
return True
self.buckets[key] = (tokens, now)
return False
10 req/sec, burst 20. Simple. Works for one process; doesn’t scale across replicas.
Phase 2: Redis-backed token bucket
-- token_bucket.lua
local key = KEYS[1]
local rate = tonumber(ARGV[1])
local burst = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local data = redis.call('HMGET', key, 'tokens', 'last')
local tokens = tonumber(data[1]) or burst
local last = tonumber(data[2]) or now
tokens = math.min(burst, tokens + (now - last) * rate)
local allowed = tokens >= 1
if allowed then tokens = tokens - 1 end
redis.call('HMSET', key, 'tokens', tokens, 'last', now)
redis.call('EXPIRE', key, math.ceil(burst / rate) + 60)
return allowed and 1 or 0
Lua script = atomic. Run once, get an allow/deny. Works across replicas because state lives in Redis.
allowed = await redis.eval(LUA, 1, f"rl:{user_id}", 10, 20, time.time())
Phase 3: sliding window
Token bucket allows brief bursts. For strict “100 requests per minute,” use sliding window:
-- sliding_window_log.lua
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
redis.call('ZREMRANGEBYSCORE', key, '-inf', now - window)
local count = redis.call('ZCARD', key)
if count >= limit then return 0 end
redis.call('ZADD', key, now, now)
redis.call('EXPIRE', key, window + 1)
return 1
Memory cost: O(N) per key (one entry per recent request). Use sparingly.
For approximation: sliding window counter — combines two fixed windows weighted by overlap. O(1) memory, ~1% accuracy off.
Phase 4: layered limits
async def check_rate(req: Request, user_id: int):
# Per-IP (DoS protection)
if not await allow(f"ip:{req.client_ip}", rate=100, burst=200):
raise TooManyRequests()
# Per-user
if not await allow(f"user:{user_id}", rate=10, burst=20):
raise TooManyRequests()
# Per-endpoint expensive
if req.path == "/expensive":
if not await allow(f"exp:{user_id}", rate=1, burst=5):
raise TooManyRequests()
Layered limits compose. Free tier ≠ paid tier; see your pricing.
Distributed challenges
Hot keys: one user spammed key gets 100k QPS. Single Redis becomes bottleneck. Solutions:
- Sharding the rate limiter across multiple Redis nodes by hash of key.
- Local first, sync async: each replica enforces locally with a fraction of the budget; periodically reconciles.
Clock skew: distributed nodes have slightly different clocks. Use Redis’s clock or redis_time for consistency.
Headers and UX
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714650000
Retry-After: 30
Tell the client what just happened and when they can retry. Saves support tickets.
Fairness
Token bucket per user, but a single user with many connections can starve others.
For multi-tenant systems: enforce limits at the ingress (API gateway) per tenant, then again per process.
Algorithms summary
| Algorithm | Bursts? | Memory | Use |
|---|---|---|---|
| Fixed window | Yes (boundary effect) | O(1) | Simplest; limited fairness |
| Sliding window log | No | O(N) | Strict; expensive |
| Sliding window counter | Approx | O(1) | Common compromise |
| Token bucket | Bursts allowed | O(1) | API limits |
| Leaky bucket | No (smoothed output) | O(1) | Throttling outputs |
Default: token bucket for user APIs; sliding window counter for strict limits.
Tools
- nginx limit_req — module; fixed-rate.
- Cloudflare / Fastly — edge rate limiting.
- Envoy / Linkerd — sidecar rate limiting.
- Redis + Lua — DIY distributed.
- API gateways (Kong, Tyk) — managed.
For most: Redis + Lua behind your service, with edge protection at the LB.
Common mistakes
1. Per-IP only
Behind NAT, one office shares an IP. All blocked. Pair with per-user limits.
2. No 429 response
Drop the request silently. Client retries hard. Always 429 + Retry-After.
3. Hard-coded limits
Limits live in code; ops can’t change without deploy. Externalize to config or DB.
4. Limits per process not per cluster
10 replicas each with 10 req/s = 100 req/s effective. Use a shared store (Redis).
5. No emergency override
Legit traffic spike misclassified as abuse. Need a knob to raise / lift limits in seconds.
What I’d ship today
For a typical SaaS:
- Edge (Cloudflare / nginx): basic per-IP limits, DoS protection.
- API gateway (Kong / custom): per-user / per-API-key limits.
- App-level: per-endpoint expensive operations.
- Redis + Lua for the dynamic shared state.
- Headers on every response.
- 429 + Retry-After when limited.
- Dashboards for limit hit rates per tier.
Read this next
- Rate Limiting Patterns 2026
- Caching Strategies 2026
- API Gateway Patterns 2026
- Design a Distributed Counter
If you want my Redis Lua + Python rate limiter library, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .