Rate limiting is the boring infrastructure that prevents abuse, controls costs, and enforces fair use. Done right, no one notices. Done wrong, real users get blocked while bots get through. This post is the working design.

Phase 1: in-memory single-node

from collections import defaultdict
import time

class TokenBucket:
    def __init__(self, rate: float, burst: int):
        self.rate = rate; self.burst = burst
        self.buckets = {}
    
    def allow(self, key: str) -> bool:
        now = time.time()
        tokens, last = self.buckets.get(key, (self.burst, now))
        tokens = min(self.burst, tokens + (now - last) * self.rate)
        if tokens >= 1:
            self.buckets[key] = (tokens - 1, now)
            return True
        self.buckets[key] = (tokens, now)
        return False

10 req/sec, burst 20. Simple. Works for one process; doesn’t scale across replicas.

Phase 2: Redis-backed token bucket

-- token_bucket.lua
local key = KEYS[1]
local rate = tonumber(ARGV[1])
local burst = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local data = redis.call('HMGET', key, 'tokens', 'last')
local tokens = tonumber(data[1]) or burst
local last = tonumber(data[2]) or now

tokens = math.min(burst, tokens + (now - last) * rate)
local allowed = tokens >= 1
if allowed then tokens = tokens - 1 end

redis.call('HMSET', key, 'tokens', tokens, 'last', now)
redis.call('EXPIRE', key, math.ceil(burst / rate) + 60)

return allowed and 1 or 0

Lua script = atomic. Run once, get an allow/deny. Works across replicas because state lives in Redis.

allowed = await redis.eval(LUA, 1, f"rl:{user_id}", 10, 20, time.time())

Phase 3: sliding window

Token bucket allows brief bursts. For strict “100 requests per minute,” use sliding window:

-- sliding_window_log.lua
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])

redis.call('ZREMRANGEBYSCORE', key, '-inf', now - window)
local count = redis.call('ZCARD', key)
if count >= limit then return 0 end

redis.call('ZADD', key, now, now)
redis.call('EXPIRE', key, window + 1)
return 1

Memory cost: O(N) per key (one entry per recent request). Use sparingly.

For approximation: sliding window counter — combines two fixed windows weighted by overlap. O(1) memory, ~1% accuracy off.

Phase 4: layered limits

async def check_rate(req: Request, user_id: int):
    # Per-IP (DoS protection)
    if not await allow(f"ip:{req.client_ip}", rate=100, burst=200):
        raise TooManyRequests()
    
    # Per-user
    if not await allow(f"user:{user_id}", rate=10, burst=20):
        raise TooManyRequests()
    
    # Per-endpoint expensive
    if req.path == "/expensive":
        if not await allow(f"exp:{user_id}", rate=1, burst=5):
            raise TooManyRequests()

Layered limits compose. Free tier ≠ paid tier; see your pricing.

Distributed challenges

Hot keys: one user spammed key gets 100k QPS. Single Redis becomes bottleneck. Solutions:

  • Sharding the rate limiter across multiple Redis nodes by hash of key.
  • Local first, sync async: each replica enforces locally with a fraction of the budget; periodically reconciles.

Clock skew: distributed nodes have slightly different clocks. Use Redis’s clock or redis_time for consistency.

Headers and UX

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714650000
Retry-After: 30

Tell the client what just happened and when they can retry. Saves support tickets.

Fairness

Token bucket per user, but a single user with many connections can starve others.

For multi-tenant systems: enforce limits at the ingress (API gateway) per tenant, then again per process.

Algorithms summary

AlgorithmBursts?MemoryUse
Fixed windowYes (boundary effect)O(1)Simplest; limited fairness
Sliding window logNoO(N)Strict; expensive
Sliding window counterApproxO(1)Common compromise
Token bucketBursts allowedO(1)API limits
Leaky bucketNo (smoothed output)O(1)Throttling outputs

Default: token bucket for user APIs; sliding window counter for strict limits.

Tools

  • nginx limit_req — module; fixed-rate.
  • Cloudflare / Fastly — edge rate limiting.
  • Envoy / Linkerd — sidecar rate limiting.
  • Redis + Lua — DIY distributed.
  • API gateways (Kong, Tyk) — managed.

For most: Redis + Lua behind your service, with edge protection at the LB.

Common mistakes

1. Per-IP only

Behind NAT, one office shares an IP. All blocked. Pair with per-user limits.

2. No 429 response

Drop the request silently. Client retries hard. Always 429 + Retry-After.

3. Hard-coded limits

Limits live in code; ops can’t change without deploy. Externalize to config or DB.

4. Limits per process not per cluster

10 replicas each with 10 req/s = 100 req/s effective. Use a shared store (Redis).

5. No emergency override

Legit traffic spike misclassified as abuse. Need a knob to raise / lift limits in seconds.

What I’d ship today

For a typical SaaS:

  • Edge (Cloudflare / nginx): basic per-IP limits, DoS protection.
  • API gateway (Kong / custom): per-user / per-API-key limits.
  • App-level: per-endpoint expensive operations.
  • Redis + Lua for the dynamic shared state.
  • Headers on every response.
  • 429 + Retry-After when limited.
  • Dashboards for limit hit rates per tier.

Read this next

If you want my Redis Lua + Python rate limiter library, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .