Rate limits are a customer-facing API. Done well, customers integrate happily and rarely hit them. Done poorly, you generate support tickets and lose trust. This post is the working playbook.

Tier matrix

Free:           60 req/min,    1k req/day,   no concurrency
Pro:          1000 req/min,  100k req/day,  10 concurrent
Enterprise: Custom (negotiated per contract)

Document this. Customers want to know before they integrate, not after they hit a limit.

Per-endpoint limits

Some endpoints cost more:

/v1/users/{id}        cheap     1 unit
/v1/search            medium    5 units
/v1/reports/generate  heavy    50 units

Token-bucket with weighted costs. Heavy endpoints consume more budget. Customers can’t hammer expensive ones to ruin baseline.

The headers

Standard:

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1714650000
RateLimit: limit=1000, remaining=847, reset=60

RateLimit header (RFC draft) is the standardized form. X-RateLimit-* is legacy; still expected by clients.

429 responses

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{
  "error": "rate_limit_exceeded",
  "message": "60 requests per minute exceeded for free tier",
  "retry_after_seconds": 30,
  "documentation_url": "https://docs.example.com/rate-limits"
}

Retry-After in seconds. Machine-readable. Doc link for humans. Specific message about which limit.

Per-user vs per-IP

async def check(req):
    if user := req.user:
        if not await allow(f"user:{user.id}", user.tier_limits):
            raise TooManyRequests
    else:
        if not await allow(f"ip:{req.ip}", anon_limits):
            raise TooManyRequests

Authenticated → per-user (per-key in B2B). Anonymous → per-IP. They compose.

Cost endpoints

For variable-cost ops (LLM calls, large queries):

async def check_cost(user_id, estimated_cost):
    if not await allow(f"user:{user_id}", cost=estimated_cost):
        raise TooManyRequests

Estimate before; charge actual after. Adjust budget on actual.

Concurrency limits

Distinct from rate (per-time):

async def check_concurrency(user_id):
    sem_key = f"concur:{user_id}"
    current = await redis.incr(sem_key)
    await redis.expire(sem_key, 60)  # safety
    if current > tier_limit:
        await redis.decr(sem_key)
        raise TooManyConcurrent
    return sem_key

# After:
await redis.decr(sem_key)

Prevents a single user from holding 1000 concurrent connections.

Fairness

For multi-tenant: one heavy user shouldn’t starve others. Layer:

  • Global cap on the system.
  • Per-tier caps on aggregate.
  • Per-user caps within tier.
  • Per-endpoint caps for expensive ops.

Tighter limits inside; looser outside. Multi-layer fairness.

Stripe-style patterns

  • Per-secret-key: each integration has its own budget.
  • Idempotency keys combined with rate limits: same key in window doesn’t count again.
  • Stripe-Account header: per-Connected-account limits for platform integrations.

Maps well to platform / partner APIs.

GitHub-style patterns

  • Per-token + IP combined.
  • Cost per query (GraphQL): node count weighting.
  • Secondary rate limits for abuse patterns (rapid creation, search).

For GraphQL APIs: weight by query depth + breadth. Cheap queries use little; expensive queries use more.

SDK behavior

A good SDK respects 429 + Retry-After:

async function call(req) {
    for (let i = 0; i < 5; i++) {
        const resp = await fetch(req);
        if (resp.status === 429) {
            const wait = parseInt(resp.headers.get("retry-after") || "10");
            await sleep((wait + Math.random() * 0.5) * 1000);
            continue;
        }
        return resp;
    }
    throw new Error("rate limit exhausted");
}

Document this behavior; offer SDKs that handle it.

Soft limits before hard

At 80% budget consumed: response includes warning header.
At 100%: 429.

Gives customers a chance to back off before being blocked.

RateLimit-Warning: approaching limit; 200 of 1000 remaining

Customer dashboards

Show:

  • Current minute / hour / day usage.
  • Per-endpoint breakdown.
  • Recent 429s.
  • Quota and tier.

Customers integrate better when they can see usage. Reduces support load too.

Rate limit increases

Process for “I need more”:

  • Self-service for free → pro upgrade.
  • Form for enterprise with usage justification.
  • Engineering review for bulk requests; protect against abuse.

Make it easy for legit customers; gate the rest.

Common mistakes

1. Drop without 429

Connection reset; client retries hard. Always 429 + Retry-After.

2. Inconsistent limit communication

Headers say one thing; docs another; reality a third. Single source of truth.

3. No per-endpoint differentiation

A search query and a small read share the same budget. Heavy ops consume more.

4. No customer visibility

Customers find out the hard way. Dashboards + headers.

5. Hard cliff with no backoff signal

Suddenly 429 at minute boundary; spike of retries. Smooth via token bucket; signal approaching limit early.

Implementation

  • Edge / API gateway: anon and IP limits.
  • App layer: per-user / per-key / per-endpoint.
  • Redis-backed token bucket with Lua for atomicity.

See Rate Limiter Design .

What I’d ship today

For a public API:

  • Per-tier docs with concrete numbers.
  • RateLimit headers on every response.
  • 429 + Retry-After + JSON error body.
  • Per-endpoint cost weights.
  • Per-user + per-IP layers.
  • Concurrency caps for streaming / WebSocket.
  • Customer usage dashboard.
  • Self-service upgrade flow.
  • SDK that respects 429.

Read this next

If you want my customer-facing rate limit playbook + dashboard template, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .