Rate limits are a customer-facing API. Done well, customers integrate happily and rarely hit them. Done poorly, you generate support tickets and lose trust. This post is the working playbook.
Tier matrix
Free: 60 req/min, 1k req/day, no concurrency
Pro: 1000 req/min, 100k req/day, 10 concurrent
Enterprise: Custom (negotiated per contract)
Document this. Customers want to know before they integrate, not after they hit a limit.
Per-endpoint limits
Some endpoints cost more:
/v1/users/{id} cheap 1 unit
/v1/search medium 5 units
/v1/reports/generate heavy 50 units
Token-bucket with weighted costs. Heavy endpoints consume more budget. Customers can’t hammer expensive ones to ruin baseline.
The headers
Standard:
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1714650000
RateLimit: limit=1000, remaining=847, reset=60
RateLimit header (RFC draft) is the standardized form. X-RateLimit-* is legacy; still expected by clients.
429 responses
HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json
{
"error": "rate_limit_exceeded",
"message": "60 requests per minute exceeded for free tier",
"retry_after_seconds": 30,
"documentation_url": "https://docs.example.com/rate-limits"
}
Retry-After in seconds. Machine-readable. Doc link for humans. Specific message about which limit.
Per-user vs per-IP
async def check(req):
if user := req.user:
if not await allow(f"user:{user.id}", user.tier_limits):
raise TooManyRequests
else:
if not await allow(f"ip:{req.ip}", anon_limits):
raise TooManyRequests
Authenticated → per-user (per-key in B2B). Anonymous → per-IP. They compose.
Cost endpoints
For variable-cost ops (LLM calls, large queries):
async def check_cost(user_id, estimated_cost):
if not await allow(f"user:{user_id}", cost=estimated_cost):
raise TooManyRequests
Estimate before; charge actual after. Adjust budget on actual.
Concurrency limits
Distinct from rate (per-time):
async def check_concurrency(user_id):
sem_key = f"concur:{user_id}"
current = await redis.incr(sem_key)
await redis.expire(sem_key, 60) # safety
if current > tier_limit:
await redis.decr(sem_key)
raise TooManyConcurrent
return sem_key
# After:
await redis.decr(sem_key)
Prevents a single user from holding 1000 concurrent connections.
Fairness
For multi-tenant: one heavy user shouldn’t starve others. Layer:
- Global cap on the system.
- Per-tier caps on aggregate.
- Per-user caps within tier.
- Per-endpoint caps for expensive ops.
Tighter limits inside; looser outside. Multi-layer fairness.
Stripe-style patterns
- Per-secret-key: each integration has its own budget.
- Idempotency keys combined with rate limits: same key in window doesn’t count again.
Stripe-Accountheader: per-Connected-account limits for platform integrations.
Maps well to platform / partner APIs.
GitHub-style patterns
- Per-token + IP combined.
- Cost per query (GraphQL): node count weighting.
- Secondary rate limits for abuse patterns (rapid creation, search).
For GraphQL APIs: weight by query depth + breadth. Cheap queries use little; expensive queries use more.
SDK behavior
A good SDK respects 429 + Retry-After:
async function call(req) {
for (let i = 0; i < 5; i++) {
const resp = await fetch(req);
if (resp.status === 429) {
const wait = parseInt(resp.headers.get("retry-after") || "10");
await sleep((wait + Math.random() * 0.5) * 1000);
continue;
}
return resp;
}
throw new Error("rate limit exhausted");
}
Document this behavior; offer SDKs that handle it.
Soft limits before hard
At 80% budget consumed: response includes warning header.
At 100%: 429.
Gives customers a chance to back off before being blocked.
RateLimit-Warning: approaching limit; 200 of 1000 remaining
Customer dashboards
Show:
- Current minute / hour / day usage.
- Per-endpoint breakdown.
- Recent 429s.
- Quota and tier.
Customers integrate better when they can see usage. Reduces support load too.
Rate limit increases
Process for “I need more”:
- Self-service for free → pro upgrade.
- Form for enterprise with usage justification.
- Engineering review for bulk requests; protect against abuse.
Make it easy for legit customers; gate the rest.
Common mistakes
1. Drop without 429
Connection reset; client retries hard. Always 429 + Retry-After.
2. Inconsistent limit communication
Headers say one thing; docs another; reality a third. Single source of truth.
3. No per-endpoint differentiation
A search query and a small read share the same budget. Heavy ops consume more.
4. No customer visibility
Customers find out the hard way. Dashboards + headers.
5. Hard cliff with no backoff signal
Suddenly 429 at minute boundary; spike of retries. Smooth via token bucket; signal approaching limit early.
Implementation
- Edge / API gateway: anon and IP limits.
- App layer: per-user / per-key / per-endpoint.
- Redis-backed token bucket with Lua for atomicity.
See Rate Limiter Design .
What I’d ship today
For a public API:
- Per-tier docs with concrete numbers.
RateLimitheaders on every response.- 429 + Retry-After + JSON error body.
- Per-endpoint cost weights.
- Per-user + per-IP layers.
- Concurrency caps for streaming / WebSocket.
- Customer usage dashboard.
- Self-service upgrade flow.
- SDK that respects 429.
Read this next
- Design a Rate Limiter 2026
- API Versioning 2026
- API Rate Limit Design 2026
- Idempotency, Retries, and Exactly-Once Illusions
If you want my customer-facing rate limit playbook + dashboard template, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .