Real-World System Design Case Studies

Design a Rate Limiter

18 min Lesson 3 of 10

Design a Rate Limiter

A rate limiter controls how many requests a client can make to a service within a given time window. Without one, a single misbehaving client — or an attacker running a credential-stuffing script — can saturate your servers, degrade performance for everyone, and run up your cloud bill. At scale, rate limiting is not optional; it is a first-class infrastructure concern.

Real-world examples: GitHub enforces 5,000 API requests per hour per authenticated user. Stripe caps payment API calls at 100 requests per second per API key. Twilio rate-limits SMS sends per account to prevent spam. Each of these requires a distributed, high-performance limiter that adds no more than a few milliseconds of overhead.

Step 1 — Clarify Requirements

Before choosing an algorithm, nail down the constraints:

Granularity: limit per user? per IP? per API key? per endpoint?
Window semantics: fixed window (e.g. "60 requests per minute, resets at :00")? sliding window? token bucket (steady drip)?
Enforcement point: API gateway, service mesh, application middleware, or a shared library?
Scale: how many unique limiters? A service with 10 million users needs 10 million counters — that fits in ~80 MB of RAM (8 bytes each), easily cacheable.
Accuracy vs. performance: is it acceptable to allow a small overage, or must limits be exact?
Response on breach: hard reject with 429 Too Many Requests? Soft throttle? Queue?

Step 2 — Rate-Limiting Algorithms

There are four widely used algorithms. Each solves a different problem.

Fixed Window Counter

Divide time into fixed buckets (e.g. each minute). Increment a counter for each request; reject once the counter exceeds the limit. Reset at the boundary.

Pro: O(1) memory per client; trivially implemented with a single Redis INCR + EXPIRE.
Con: boundary burst. If the limit is 100 req/min and a client sends 100 at 00:59 and 100 more at 01:01, they get 200 requests in 2 seconds — double the intended rate.

Sliding Window Log

Store a sorted set (timestamp log) of every request. On each new request, remove entries older than the window, count the remaining entries, and allow or reject.

Pro: Perfectly accurate; no boundary burst.
Con: Memory proportional to request count per client. At 1,000 req/min per client, storing each timestamp costs ~50 bytes → 50 KB per heavy client. Redis sorted sets work well here, but memory can spike.

Sliding Window Counter (Hybrid)

Keep two fixed-window buckets: the current minute and the previous minute. Weight them by how far into the current window you are:

estimated_count =
    prev_bucket_count * (1 - elapsed_in_current_window / window_size)
    + current_bucket_count

This approximation is accurate to within ~0.1% for smooth traffic and costs only 2 counters per client. It is the algorithm used by Cloudflare for their 429-limit rules at trillion-request scale.

Token Bucket

Each client has a bucket holding up to capacity tokens. Tokens refill at a fixed rate (e.g. 10/sec). Each request consumes one token; if the bucket is empty the request is rejected. The bucket is defined by two numbers: capacity (burst ceiling) and refill rate (sustained rate).

Pro: Naturally allows short bursts up to capacity; smooth sustained rate. Amazon API Gateway uses this model.
Con: Requires storing last-refill timestamp and current token count. Concurrency control is needed when multiple nodes update the same bucket.

Rule of thumb: Use sliding window counter when you need accuracy with minimal memory. Use token bucket when you want to allow legitimate bursts (e.g. a mobile app that sends a flurry of messages when it reconnects).

The four main rate-limiting algorithms and their trade-offs.

Step 3 — Distributed Storage with Redis

A single-server rate limiter is straightforward. The hard part is making it work across a cluster of API gateway nodes so that counters are consistent regardless of which node handles the request.

Redis is the canonical solution. It is single-threaded (so operations are atomic by default) and supports the key primitives needed:

INCR key + EXPIRE key seconds — fixed window counter in two commands.
ZADD key timestamp member + ZREMRANGEBYSCORE + ZCARD — sliding window log.
Lua scripts — bundle multiple commands into a single atomic operation, eliminating race conditions between check and increment.

Use a Lua script for the sliding window counter. A Lua script runs atomically on the Redis server; no other client can interleave commands between the read and the write. This is far simpler than distributed locks and adds essentially zero latency.

-- Sliding window counter (Lua, runs atomically on Redis)
local key_curr = KEYS[1]          -- "rl:{user}:{minute}"
local key_prev = KEYS[2]          -- "rl:{user}:{prev_minute}"
local limit    = tonumber(ARGV[1])
local now_frac = tonumber(ARGV[2]) -- e.g. 0.75 = 45s into current minute

local prev = tonumber(redis.call('GET', key_prev) or 0)
local curr = tonumber(redis.call('GET', key_curr) or 0)
local estimate = prev * (1 - now_frac) + curr

if estimate >= limit then
  return 0  -- rejected
end
redis.call('INCR', key_curr)
redis.call('EXPIRE', key_curr, 120)
return 1  -- allowed

Step 4 — Distributed Architecture

At high scale the rate limiter itself must not become a bottleneck. The reference architecture places a thin rate-limiter sidecar (or API gateway rule) in front of every backend service. Each instance communicates with a small Redis cluster — not one global Redis, because a single Redis instance tops out at ~100–200k ops/sec.

Key design decisions:

Shard by client key: hash(user_id) % num_redis_nodes means each user's counter always lands on the same node — no cross-shard coordination needed.
Local cache for hot keys: A 10 ms in-process cache per gateway node prevents every request hitting Redis. The tradeoff is a brief window where a client can slightly exceed the limit across nodes before the cache expires. For most use cases this is acceptable.
Fallback policy: If Redis is unreachable, decide upfront: fail open (allow the request) or fail closed (reject). Fail open preserves availability; fail closed prevents abuse during outages. Most production systems fail open with an alert.
Return headers: always set X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset so clients can back off gracefully rather than hammering a 429 wall.

Distributed rate limiter: API gateway applies local cache + Lua check against a sharded Redis cluster.

Step 5 — Edge Cases and Production Concerns

Clock skew: When multiple nodes compute "which minute are we in?" using their local clocks, small clock differences (~1 s) can cause counters to land in different buckets. Synchronize clocks with NTP and use a window size of at least 10 seconds to make 1-second skew irrelevant.
Race condition on first request: Two nodes simultaneously doing GET → 0 → INCR can both allow a request that should be the last. A Lua script or Redis atomic SET NX eliminates this.
Multi-tier limiting: Apply limits at multiple scopes simultaneously — per IP (to block bots), per user (fair use), per API key (SLA tiers). Store each as a separate Redis key; check all three in one pipeline round-trip.
Distributed Lua caveat: In Redis Cluster, all keys in a Lua script must hash to the same slot. Use hash tags: rl:{user_id}:curr and rl:{user_id}:prev both contain {user_id}, so Redis routes them to the same node.

Avoid storing rate-limit state in your application database. At 50,000 req/sec, every rate-limit check is a write. Even a fast PostgreSQL will buckle. Redis handles this at sub-millisecond latency with no contention — that is precisely why it was built.

Summary

A production-grade distributed rate limiter combines the right algorithm (sliding window counter for accuracy with low memory, token bucket for burst tolerance) with Redis for atomic, low-latency counter storage, Lua scripts for race-free updates, key sharding for horizontal scale, and a clear fallback policy when storage is unavailable. The result is a component that adds fewer than 2 ms to each request while protecting your backend from tens of thousands of abusive clients per second.