Design a Rate Limiter
Design a Rate Limiter
A rate limiter controls how many requests a client can make to a service within a given time window. Without one, a single misbehaving client — or an attacker running a credential-stuffing script — can saturate your servers, degrade performance for everyone, and run up your cloud bill. At scale, rate limiting is not optional; it is a first-class infrastructure concern.
Real-world examples: GitHub enforces 5,000 API requests per hour per authenticated user. Stripe caps payment API calls at 100 requests per second per API key. Twilio rate-limits SMS sends per account to prevent spam. Each of these requires a distributed, high-performance limiter that adds no more than a few milliseconds of overhead.
Step 1 — Clarify Requirements
Before choosing an algorithm, nail down the constraints:
- Granularity: limit per user? per IP? per API key? per endpoint?
- Window semantics: fixed window (e.g. "60 requests per minute, resets at :00")? sliding window? token bucket (steady drip)?
- Enforcement point: API gateway, service mesh, application middleware, or a shared library?
- Scale: how many unique limiters? A service with 10 million users needs 10 million counters — that fits in ~80 MB of RAM (8 bytes each), easily cacheable.
- Accuracy vs. performance: is it acceptable to allow a small overage, or must limits be exact?
- Response on breach: hard reject with
429 Too Many Requests? Soft throttle? Queue?
Step 2 — Rate-Limiting Algorithms
There are four widely used algorithms. Each solves a different problem.
Fixed Window Counter
Divide time into fixed buckets (e.g. each minute). Increment a counter for each request; reject once the counter exceeds the limit. Reset at the boundary.
- Pro: O(1) memory per client; trivially implemented with a single Redis
INCR+EXPIRE. - Con: boundary burst. If the limit is 100 req/min and a client sends 100 at 00:59 and 100 more at 01:01, they get 200 requests in 2 seconds — double the intended rate.
Sliding Window Log
Store a sorted set (timestamp log) of every request. On each new request, remove entries older than the window, count the remaining entries, and allow or reject.
- Pro: Perfectly accurate; no boundary burst.
- Con: Memory proportional to request count per client. At 1,000 req/min per client, storing each timestamp costs ~50 bytes → 50 KB per heavy client. Redis sorted sets work well here, but memory can spike.
Sliding Window Counter (Hybrid)
Keep two fixed-window buckets: the current minute and the previous minute. Weight them by how far into the current window you are:
This approximation is accurate to within ~0.1% for smooth traffic and costs only 2 counters per client. It is the algorithm used by Cloudflare for their 429-limit rules at trillion-request scale.
Token Bucket
Each client has a bucket holding up to capacity tokens. Tokens refill at a fixed rate (e.g. 10/sec). Each request consumes one token; if the bucket is empty the request is rejected. The bucket is defined by two numbers: capacity (burst ceiling) and refill rate (sustained rate).
- Pro: Naturally allows short bursts up to
capacity; smooth sustained rate. Amazon API Gateway uses this model. - Con: Requires storing last-refill timestamp and current token count. Concurrency control is needed when multiple nodes update the same bucket.
Step 3 — Distributed Storage with Redis
A single-server rate limiter is straightforward. The hard part is making it work across a cluster of API gateway nodes so that counters are consistent regardless of which node handles the request.
Redis is the canonical solution. It is single-threaded (so operations are atomic by default) and supports the key primitives needed:
INCR key+EXPIRE key seconds— fixed window counter in two commands.ZADD key timestamp member+ZREMRANGEBYSCORE+ZCARD— sliding window log.- Lua scripts — bundle multiple commands into a single atomic operation, eliminating race conditions between check and increment.
Step 4 — Distributed Architecture
At high scale the rate limiter itself must not become a bottleneck. The reference architecture places a thin rate-limiter sidecar (or API gateway rule) in front of every backend service. Each instance communicates with a small Redis cluster — not one global Redis, because a single Redis instance tops out at ~100–200k ops/sec.
Key design decisions:
- Shard by client key:
hash(user_id) % num_redis_nodesmeans each user's counter always lands on the same node — no cross-shard coordination needed. - Local cache for hot keys: A 10 ms in-process cache per gateway node prevents every request hitting Redis. The tradeoff is a brief window where a client can slightly exceed the limit across nodes before the cache expires. For most use cases this is acceptable.
- Fallback policy: If Redis is unreachable, decide upfront: fail open (allow the request) or fail closed (reject). Fail open preserves availability; fail closed prevents abuse during outages. Most production systems fail open with an alert.
- Return headers: always set
X-RateLimit-Limit,X-RateLimit-Remaining, andX-RateLimit-Resetso clients can back off gracefully rather than hammering a 429 wall.
Step 5 — Edge Cases and Production Concerns
- Clock skew: When multiple nodes compute "which minute are we in?" using their local clocks, small clock differences (~1 s) can cause counters to land in different buckets. Synchronize clocks with NTP and use a window size of at least 10 seconds to make 1-second skew irrelevant.
- Race condition on first request: Two nodes simultaneously doing GET → 0 → INCR can both allow a request that should be the last. A Lua script or Redis atomic
SET NXeliminates this. - Multi-tier limiting: Apply limits at multiple scopes simultaneously — per IP (to block bots), per user (fair use), per API key (SLA tiers). Store each as a separate Redis key; check all three in one pipeline round-trip.
- Distributed Lua caveat: In Redis Cluster, all keys in a Lua script must hash to the same slot. Use hash tags:
rl:{user_id}:currandrl:{user_id}:prevboth contain{user_id}, so Redis routes them to the same node.
Summary
A production-grade distributed rate limiter combines the right algorithm (sliding window counter for accuracy with low memory, token bucket for burst tolerance) with Redis for atomic, low-latency counter storage, Lua scripts for race-free updates, key sharding for horizontal scale, and a clear fallback policy when storage is unavailable. The result is a component that adds fewer than 2 ms to each request while protecting your backend from tens of thousands of abusive clients per second.