Rate Limiting Strategies for APIs

Rate limiting is one of those features that goes unappreciated until you don’t have it, and then your service goes down because someone wrote a for url in urls: requests.get(url) loop with a million URLs.

This post covers the four classic rate limiting algorithms, when to use each, how to implement them in Redis, and what HTTP headers your API should send back so clients know what’s happening.

Why rate limit at all?

Three different reasons, with different implications:

Protect the service from overload. A bot sending 10,000 requests/sec will OOM your DB. Rate limit to keep the system upright.
Fair usage across customers. One greedy customer shouldn’t degrade everyone else.
Pricing tiers. Free tier = 100 req/min, Pro = 10,000 req/min. Rate limiting is how you enforce the plan.

These overlap, but they’re not the same. Tier enforcement happens per API key. Overload protection might happen per IP or per endpoint regardless of who you are.

The four algorithms

1. Fixed window

Count requests within fixed time buckets:

Limit: 100 requests per minute.

12:00:00–12:00:59 → counter resets at 12:01:00
                  → user makes 100 requests at 12:00:30
                  → 101st request: blocked

Pros: simplest to implement; one counter per window. Cons: burst at the boundary. If you allow 100/min, a user can do 100 at 12:00:59 and another 100 at 12:01:00 = 200 in 2 seconds.

def is_allowed(key: str, limit: int = 100, window: int = 60) -> bool:
    bucket = f"rl:{key}:{int(time.time()) // window}"
    count = redis.incr(bucket)
    if count == 1:
        redis.expire(bucket, window)
    return count <= limit

Two Redis operations per request. Cheap. Fine for most cases.

2. Sliding window log

Store the timestamp of every request; count how many fall within the last N seconds.

Pros: exact, no burst at boundaries. Cons: memory grows with traffic — every request adds an entry.

def is_allowed(key: str, limit: int = 100, window: int = 60) -> bool:
    now = time.time()
    bucket = f"rl:log:{key}"
    pipe = redis.pipeline()
    pipe.zremrangebyscore(bucket, 0, now - window)  # drop old entries
    pipe.zadd(bucket, {str(uuid.uuid4()): now})     # add this request
    pipe.zcard(bucket)                              # count remaining
    pipe.expire(bucket, window)
    _, _, count, _ = pipe.execute()
    return count <= limit

Memory: O(traffic × window). For very high throughput, prefer the next pattern.

3. Sliding window counter (the practical winner)

Approximate the sliding window using two fixed-window counters:

Count requests in the current window.
Count requests in the previous window, weighted by how much of it overlaps the sliding window.

def is_allowed(key: str, limit: int = 100, window: int = 60) -> bool:
    now = time.time()
    current_window = int(now) // window
    previous_window = current_window - 1
    elapsed_in_current = (now % window) / window  # 0.0 to 1.0

    pipe = redis.pipeline()
    pipe.get(f"rl:{key}:{current_window}")
    pipe.get(f"rl:{key}:{previous_window}")
    cur_raw, prev_raw = pipe.execute()
    cur = int(cur_raw or 0)
    prev = int(prev_raw or 0)

    estimated = prev * (1 - elapsed_in_current) + cur
    if estimated >= limit:
        return False

    pipe = redis.pipeline()
    pipe.incr(f"rl:{key}:{current_window}")
    pipe.expire(f"rl:{key}:{current_window}", window * 2)
    pipe.execute()
    return True

Pros: O(1) memory; smooths out the boundary burst. Cons: approximate (off by a few percent). Almost always good enough.

This is what Cloudflare, Stripe, and most large APIs actually use.

4. Token bucket

A bucket holds N tokens. Each request consumes 1 token. Tokens refill at a steady rate (R tokens/sec). Rejected when empty.

def is_allowed(key: str, capacity: int = 100, refill_rate: float = 1.6) -> bool:
    """capacity = max burst; refill_rate = sustained req/sec."""
    now = time.time()
    state = redis.hgetall(f"rl:tb:{key}")
    tokens = float(state.get("tokens", capacity))
    last = float(state.get("last", now))

    # Refill based on elapsed time
    tokens = min(capacity, tokens + (now - last) * refill_rate)
    if tokens < 1:
        redis.hset(f"rl:tb:{key}", mapping={"tokens": tokens, "last": now})
        return False

    redis.hset(f"rl:tb:{key}", mapping={"tokens": tokens - 1, "last": now})
    redis.expire(f"rl:tb:{key}", 3600)
    return True

Pros: allows controlled bursts (full bucket = burst of capacity); fine-grained control over sustained rate. Cons: more state to track per key.

This is what AWS, GCP, and most cloud providers use for their rate limits. It’s also the algorithm of choice when you want to allow legitimate bursts.

5. Leaky bucket

Conceptual cousin of the token bucket — a fixed-rate “drain” of a queue. Useful for smoothing bursty input rather than allowing bursts. Less common in API rate limiting; more common in network shaping.

Choosing the right algorithm

Algorithm	Memory	Accuracy	Burst behavior	Implementation
Fixed window	O(1)	Loose at boundaries	Doubles at boundary	Simplest
Sliding window log	O(traffic)	Perfect	None	Easy but memory-heavy
Sliding window counter	O(1)	~99%	Smoothed	Best default
Token bucket	O(1) per key	Perfect	Allowed up to bucket size	Most flexible

For most APIs, sliding window counter is the right default. Use token bucket when you want to allow controlled bursts (e.g. SDK clients that batch).

What to limit on

Per API key — the most common; how you enforce pricing tiers.
Per user / account — when authenticated users share API keys with their own apps.
Per IP — for unauthenticated endpoints (login attempts, public APIs).
Per endpoint — different limits for different operations. /auth/login should be much stricter than /articles.
Multiple dimensions at once — enforce all of them; the strictest wins.

For login endpoints specifically, also rate limit on the target — i.e., per (username, IP) pair — to prevent credential stuffing.

HTTP headers: tell clients what’s going on

Pick one of the de facto standards:

Draft RFC headers (`RateLimit-*`)

RateLimit-Limit: 100
RateLimit-Remaining: 23
RateLimit-Reset: 47           # seconds until reset

GitHub-style headers (`X-RateLimit-*`)

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 23
X-RateLimit-Reset: 1714378200 # Unix timestamp

When you actually deny a request:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 30

{
  "error": "rate_limited",
  "message": "Too many requests. Retry in 30 seconds."
}

Retry-After (in seconds, or an HTTP date) is widely understood by clients and SDKs. Always include it.

Pick one set of headers for your API and stick with it. Don’t mix.

Where to enforce

Three layers, each useful for different things:

1. The CDN / edge

Cloudflare, Fastly, AWS WAF — they can rate limit before traffic reaches you. Best for blocking obviously abusive traffic and DDoS-style attacks. Cheap and absorbs the heaviest load.

2. The API gateway / reverse proxy

Nginx, Envoy, Kong, Traefik — rate limit at the proxy. Works without hitting your application code. Good for per-IP, per-endpoint, and global limits.

Nginx example:

limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

server {
    location /api/ {
        limit_req zone=api burst=20 nodelay;
        proxy_pass http://app;
    }
}

3. The application

For per-user, per-API-key, per-tenant limits — anything that requires authentication context. Implement with Redis as shown above, or use a library:

Python: slowapi , Flask-Limiter , django-ratelimit .
Go: uber-go/ratelimit , or roll your own with redis_rate.

For most APIs, do all three: edge for DDoS, proxy for per-IP, app for per-user/key.

Distributed systems concerns

If you have multiple app servers, your rate limiting state must be shared — that’s where Redis (or another central store) comes in. Don’t use in-process counters across a fleet; users will get N× their actual limit, where N is the number of app servers.

For very high throughput where Redis itself becomes a bottleneck:

Approximate counters (HyperLogLog, count-min sketch) — exchange a tiny accuracy hit for huge memory savings.
Local-first with periodic sync — count locally per-process, sync deltas to Redis every few seconds. Approximate, but cheap.
Distributed rate limiting libraries — Envoy’s rate limit service, Stripe’s Doorman , etc.

Designing rate limit policies

A few patterns worth stealing:

Tier-based — Free: 100/min, Pro: 1k/min, Enterprise: 10k/min. Cleanly tied to billing.
Endpoint-weighted — /heavy/operation costs 10 tokens; /cheap/lookup costs 1. Still one bucket per user, but operations cost differently.
Burst + sustained — token bucket with capacity 100, refill 10/sec. Allows bursts up to 100 but sustains 10/sec.
Quotas separate from rate limits — daily/monthly quotas (10k req/day) on top of per-second/minute rate limits.

Document the policy in your API docs. Surprised users are angry users.

Testing

Rate limit code is one of the easiest places to ship bugs because you don’t see them until you have load. Test with:

Unit tests with a mocked clock — drive time.time() forward to verify edge cases.
Load tests with k6, locust, or wrk — confirm the actual behavior under burst.
Chaos tests — kill the Redis connection during a burst; the app should fail gracefully (allow or deny — pick a side, document it).

Conclusion

Rate limiting is risk management you do once and forget. Pick sliding-window counter for the common case, token bucket where bursts matter, return proper headers, and enforce at multiple layers. Then test it before you need it — because by the time you do, your users have already noticed.

If you want to go deeper on the Redis side, see Redis Caching Strategies for Backend Developers .

Happy throttling!

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

Why rate limit at all?#

The four algorithms#

1. Fixed window#

2. Sliding window log#

3. Sliding window counter (the practical winner)#

4. Token bucket#

5. Leaky bucket#

Choosing the right algorithm#

What to limit on#

HTTP headers: tell clients what’s going on#

Draft RFC headers (RateLimit-*)#

GitHub-style headers (X-RateLimit-*)#

Where to enforce#

1. The CDN / edge#

2. The API gateway / reverse proxy#

3. The application#

Distributed systems concerns#

Designing rate limit policies#

Testing#

Conclusion#

Why rate limit at all?

The four algorithms

1. Fixed window

2. Sliding window log

3. Sliding window counter (the practical winner)

4. Token bucket

5. Leaky bucket

Choosing the right algorithm

What to limit on

HTTP headers: tell clients what’s going on

Draft RFC headers (`RateLimit-*`)

GitHub-style headers (`X-RateLimit-*`)

Where to enforce

1. The CDN / edge

2. The API gateway / reverse proxy

3. The application

Distributed systems concerns

Designing rate limit policies

Testing

Conclusion