Rate limiting is one of those features that goes unappreciated until you don’t have it, and then your service goes down because someone wrote a for url in urls: requests.get(url) loop with a million URLs.
This post covers the four classic rate limiting algorithms, when to use each, how to implement them in Redis, and what HTTP headers your API should send back so clients know what’s happening.
Why rate limit at all?
Three different reasons, with different implications:
- Protect the service from overload. A bot sending 10,000 requests/sec will OOM your DB. Rate limit to keep the system upright.
- Fair usage across customers. One greedy customer shouldn’t degrade everyone else.
- Pricing tiers. Free tier = 100 req/min, Pro = 10,000 req/min. Rate limiting is how you enforce the plan.
These overlap, but they’re not the same. Tier enforcement happens per API key. Overload protection might happen per IP or per endpoint regardless of who you are.
The four algorithms
1. Fixed window
Count requests within fixed time buckets:
Limit: 100 requests per minute.
12:00:00–12:00:59 → counter resets at 12:01:00
→ user makes 100 requests at 12:00:30
→ 101st request: blocked
Pros: simplest to implement; one counter per window. Cons: burst at the boundary. If you allow 100/min, a user can do 100 at 12:00:59 and another 100 at 12:01:00 = 200 in 2 seconds.
def is_allowed(key: str, limit: int = 100, window: int = 60) -> bool:
bucket = f"rl:{key}:{int(time.time()) // window}"
count = redis.incr(bucket)
if count == 1:
redis.expire(bucket, window)
return count <= limit
Two Redis operations per request. Cheap. Fine for most cases.
2. Sliding window log
Store the timestamp of every request; count how many fall within the last N seconds.
Pros: exact, no burst at boundaries. Cons: memory grows with traffic — every request adds an entry.
def is_allowed(key: str, limit: int = 100, window: int = 60) -> bool:
now = time.time()
bucket = f"rl:log:{key}"
pipe = redis.pipeline()
pipe.zremrangebyscore(bucket, 0, now - window) # drop old entries
pipe.zadd(bucket, {str(uuid.uuid4()): now}) # add this request
pipe.zcard(bucket) # count remaining
pipe.expire(bucket, window)
_, _, count, _ = pipe.execute()
return count <= limit
Memory: O(traffic × window). For very high throughput, prefer the next pattern.
3. Sliding window counter (the practical winner)
Approximate the sliding window using two fixed-window counters:
- Count requests in the current window.
- Count requests in the previous window, weighted by how much of it overlaps the sliding window.
def is_allowed(key: str, limit: int = 100, window: int = 60) -> bool:
now = time.time()
current_window = int(now) // window
previous_window = current_window - 1
elapsed_in_current = (now % window) / window # 0.0 to 1.0
pipe = redis.pipeline()
pipe.get(f"rl:{key}:{current_window}")
pipe.get(f"rl:{key}:{previous_window}")
cur_raw, prev_raw = pipe.execute()
cur = int(cur_raw or 0)
prev = int(prev_raw or 0)
estimated = prev * (1 - elapsed_in_current) + cur
if estimated >= limit:
return False
pipe = redis.pipeline()
pipe.incr(f"rl:{key}:{current_window}")
pipe.expire(f"rl:{key}:{current_window}", window * 2)
pipe.execute()
return True
Pros: O(1) memory; smooths out the boundary burst. Cons: approximate (off by a few percent). Almost always good enough.
This is what Cloudflare, Stripe, and most large APIs actually use.
4. Token bucket
A bucket holds N tokens. Each request consumes 1 token. Tokens refill at a steady rate (R tokens/sec). Rejected when empty.
def is_allowed(key: str, capacity: int = 100, refill_rate: float = 1.6) -> bool:
"""capacity = max burst; refill_rate = sustained req/sec."""
now = time.time()
state = redis.hgetall(f"rl:tb:{key}")
tokens = float(state.get("tokens", capacity))
last = float(state.get("last", now))
# Refill based on elapsed time
tokens = min(capacity, tokens + (now - last) * refill_rate)
if tokens < 1:
redis.hset(f"rl:tb:{key}", mapping={"tokens": tokens, "last": now})
return False
redis.hset(f"rl:tb:{key}", mapping={"tokens": tokens - 1, "last": now})
redis.expire(f"rl:tb:{key}", 3600)
return True
Pros: allows controlled bursts (full bucket = burst of capacity); fine-grained control over sustained rate.
Cons: more state to track per key.
This is what AWS, GCP, and most cloud providers use for their rate limits. It’s also the algorithm of choice when you want to allow legitimate bursts.
5. Leaky bucket
Conceptual cousin of the token bucket — a fixed-rate “drain” of a queue. Useful for smoothing bursty input rather than allowing bursts. Less common in API rate limiting; more common in network shaping.
Choosing the right algorithm
| Algorithm | Memory | Accuracy | Burst behavior | Implementation |
|---|---|---|---|---|
| Fixed window | O(1) | Loose at boundaries | Doubles at boundary | Simplest |
| Sliding window log | O(traffic) | Perfect | None | Easy but memory-heavy |
| Sliding window counter | O(1) | ~99% | Smoothed | Best default |
| Token bucket | O(1) per key | Perfect | Allowed up to bucket size | Most flexible |
For most APIs, sliding window counter is the right default. Use token bucket when you want to allow controlled bursts (e.g. SDK clients that batch).
What to limit on
- Per API key — the most common; how you enforce pricing tiers.
- Per user / account — when authenticated users share API keys with their own apps.
- Per IP — for unauthenticated endpoints (login attempts, public APIs).
- Per endpoint — different limits for different operations.
/auth/loginshould be much stricter than/articles. - Multiple dimensions at once — enforce all of them; the strictest wins.
For login endpoints specifically, also rate limit on the target — i.e., per (username, IP) pair — to prevent credential stuffing.
HTTP headers: tell clients what’s going on
Pick one of the de facto standards:
Draft RFC headers (RateLimit-*)
RateLimit-Limit: 100
RateLimit-Remaining: 23
RateLimit-Reset: 47 # seconds until reset
GitHub-style headers (X-RateLimit-*)
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 23
X-RateLimit-Reset: 1714378200 # Unix timestamp
When you actually deny a request:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 30
{
"error": "rate_limited",
"message": "Too many requests. Retry in 30 seconds."
}
Retry-After (in seconds, or an HTTP date) is widely understood by clients and SDKs. Always include it.
Pick one set of headers for your API and stick with it. Don’t mix.
Where to enforce
Three layers, each useful for different things:
1. The CDN / edge
Cloudflare, Fastly, AWS WAF — they can rate limit before traffic reaches you. Best for blocking obviously abusive traffic and DDoS-style attacks. Cheap and absorbs the heaviest load.
2. The API gateway / reverse proxy
Nginx, Envoy, Kong, Traefik — rate limit at the proxy. Works without hitting your application code. Good for per-IP, per-endpoint, and global limits.
Nginx example:
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
server {
location /api/ {
limit_req zone=api burst=20 nodelay;
proxy_pass http://app;
}
}
3. The application
For per-user, per-API-key, per-tenant limits — anything that requires authentication context. Implement with Redis as shown above, or use a library:
- Python:
slowapi,Flask-Limiter,django-ratelimit. - Go:
uber-go/ratelimit, or roll your own withredis_rate.
For most APIs, do all three: edge for DDoS, proxy for per-IP, app for per-user/key.
Distributed systems concerns
If you have multiple app servers, your rate limiting state must be shared — that’s where Redis (or another central store) comes in. Don’t use in-process counters across a fleet; users will get N× their actual limit, where N is the number of app servers.
For very high throughput where Redis itself becomes a bottleneck:
- Approximate counters (HyperLogLog, count-min sketch) — exchange a tiny accuracy hit for huge memory savings.
- Local-first with periodic sync — count locally per-process, sync deltas to Redis every few seconds. Approximate, but cheap.
- Distributed rate limiting libraries — Envoy’s rate limit service, Stripe’s Doorman , etc.
Designing rate limit policies
A few patterns worth stealing:
- Tier-based — Free: 100/min, Pro: 1k/min, Enterprise: 10k/min. Cleanly tied to billing.
- Endpoint-weighted —
/heavy/operationcosts 10 tokens;/cheap/lookupcosts 1. Still one bucket per user, but operations cost differently. - Burst + sustained — token bucket with capacity 100, refill 10/sec. Allows bursts up to 100 but sustains 10/sec.
- Quotas separate from rate limits — daily/monthly quotas (10k req/day) on top of per-second/minute rate limits.
Document the policy in your API docs. Surprised users are angry users.
Testing
Rate limit code is one of the easiest places to ship bugs because you don’t see them until you have load. Test with:
- Unit tests with a mocked clock — drive
time.time()forward to verify edge cases. - Load tests with
k6,locust, orwrk— confirm the actual behavior under burst. - Chaos tests — kill the Redis connection during a burst; the app should fail gracefully (allow or deny — pick a side, document it).
Conclusion
Rate limiting is risk management you do once and forget. Pick sliding-window counter for the common case, token bucket where bursts matter, return proper headers, and enforce at multiple layers. Then test it before you need it — because by the time you do, your users have already noticed.
If you want to go deeper on the Redis side, see Redis Caching Strategies for Backend Developers .
Happy throttling!
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .