Tokens-per-second or fixed window?

Token bucket for ergonomics (allows bursts; smoothed average). Sliding window when you need strict 'at most N per minute' semantics. Most public APIs use token bucket.

Should rate limits be per-key or per-user?

Both. Per-key for billing tier enforcement. Per-user (or per-IP for anonymous) for abuse prevention. They compose.

Designing API Rate Limits Customers Don't Hate — Tiers, Headers, and Fairness

Rate limits are a customer-facing API. Done well, customers integrate happily and rarely hit them. Done poorly, you generate support tickets and lose trust. This post is the working playbook.

Tier matrix

Free:           60 req/min,    1k req/day,   no concurrency
Pro:          1000 req/min,  100k req/day,  10 concurrent
Enterprise: Custom (negotiated per contract)

Document this. Customers want to know before they integrate, not after they hit a limit.

Per-endpoint limits

Some endpoints cost more:

/v1/users/{id}        cheap     1 unit
/v1/search            medium    5 units
/v1/reports/generate  heavy    50 units

Token-bucket with weighted costs. Heavy endpoints consume more budget. Customers can’t hammer expensive ones to ruin baseline.

The headers

Standard:

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1714650000
RateLimit: limit=1000, remaining=847, reset=60

RateLimit header (RFC draft) is the standardized form. X-RateLimit-* is legacy; still expected by clients.

429 responses

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{
  "error": "rate_limit_exceeded",
  "message": "60 requests per minute exceeded for free tier",
  "retry_after_seconds": 30,
  "documentation_url": "https://docs.example.com/rate-limits"
}

Retry-After in seconds. Machine-readable. Doc link for humans. Specific message about which limit.

Per-user vs per-IP

async def check(req):
    if user := req.user:
        if not await allow(f"user:{user.id}", user.tier_limits):
            raise TooManyRequests
    else:
        if not await allow(f"ip:{req.ip}", anon_limits):
            raise TooManyRequests

Authenticated → per-user (per-key in B2B). Anonymous → per-IP. They compose.

Cost endpoints

For variable-cost ops (LLM calls, large queries):

async def check_cost(user_id, estimated_cost):
    if not await allow(f"user:{user_id}", cost=estimated_cost):
        raise TooManyRequests

Estimate before; charge actual after. Adjust budget on actual.

Concurrency limits

Distinct from rate (per-time):

async def check_concurrency(user_id):
    sem_key = f"concur:{user_id}"
    current = await redis.incr(sem_key)
    await redis.expire(sem_key, 60)  # safety
    if current > tier_limit:
        await redis.decr(sem_key)
        raise TooManyConcurrent
    return sem_key

# After:
await redis.decr(sem_key)

Prevents a single user from holding 1000 concurrent connections.

Fairness

For multi-tenant: one heavy user shouldn’t starve others. Layer:

Global cap on the system.
Per-tier caps on aggregate.
Per-user caps within tier.
Per-endpoint caps for expensive ops.

Tighter limits inside; looser outside. Multi-layer fairness.

Stripe-style patterns

Per-secret-key: each integration has its own budget.
Idempotency keys combined with rate limits: same key in window doesn’t count again.
Stripe-Account header: per-Connected-account limits for platform integrations.

Maps well to platform / partner APIs.

GitHub-style patterns

Per-token + IP combined.
Cost per query (GraphQL): node count weighting.
Secondary rate limits for abuse patterns (rapid creation, search).

For GraphQL APIs: weight by query depth + breadth. Cheap queries use little; expensive queries use more.

SDK behavior

A good SDK respects 429 + Retry-After:

async function call(req) {
    for (let i = 0; i < 5; i++) {
        const resp = await fetch(req);
        if (resp.status === 429) {
            const wait = parseInt(resp.headers.get("retry-after") || "10");
            await sleep((wait + Math.random() * 0.5) * 1000);
            continue;
        }
        return resp;
    }
    throw new Error("rate limit exhausted");
}

Document this behavior; offer SDKs that handle it.

Soft limits before hard

At 80% budget consumed: response includes warning header.
At 100%: 429.

Gives customers a chance to back off before being blocked.

RateLimit-Warning: approaching limit; 200 of 1000 remaining

Customer dashboards

Show:

Current minute / hour / day usage.
Per-endpoint breakdown.
Recent 429s.
Quota and tier.

Customers integrate better when they can see usage. Reduces support load too.

Rate limit increases

Process for “I need more”:

Self-service for free → pro upgrade.
Form for enterprise with usage justification.
Engineering review for bulk requests; protect against abuse.

Make it easy for legit customers; gate the rest.

Common mistakes

1. Drop without 429

Connection reset; client retries hard. Always 429 + Retry-After.

2. Inconsistent limit communication

Headers say one thing; docs another; reality a third. Single source of truth.

3. No per-endpoint differentiation

A search query and a small read share the same budget. Heavy ops consume more.

4. No customer visibility

Customers find out the hard way. Dashboards + headers.

5. Hard cliff with no backoff signal

Suddenly 429 at minute boundary; spike of retries. Smooth via token bucket; signal approaching limit early.

Implementation

Edge / API gateway: anon and IP limits.
App layer: per-user / per-key / per-endpoint.
Redis-backed token bucket with Lua for atomicity.

See Rate Limiter Design .

What I’d ship today

For a public API:

Per-tier docs with concrete numbers.
RateLimit headers on every response.
429 + Retry-After + JSON error body.
Per-endpoint cost weights.
Per-user + per-IP layers.
Concurrency caps for streaming / WebSocket.
Customer usage dashboard.
Self-service upgrade flow.
SDK that respects 429.

Read this next

If you want my customer-facing rate limit playbook + dashboard template, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

Tier matrix#

Per-endpoint limits#

The headers#

429 responses#

Per-user vs per-IP#

Cost endpoints#

Concurrency limits#

Fairness#

Stripe-style patterns#

GitHub-style patterns#

SDK behavior#

Soft limits before hard#

Customer dashboards#

Rate limit increases#

Common mistakes#

1. Drop without 429#

2. Inconsistent limit communication#

3. No per-endpoint differentiation#

4. No customer visibility#

5. Hard cliff with no backoff signal#

Implementation#

What I’d ship today#

Read this next#