Real-World System Design Case Studies

Design a Rate Limiter

18 min Lesson 3 of 10

Design a Rate Limiter

A rate limiter controls how many requests a client can make to a service within a given time window. Without one, a single misbehaving client — or an attacker running a credential-stuffing script — can saturate your servers, degrade performance for everyone, and run up your cloud bill. At scale, rate limiting is not optional; it is a first-class infrastructure concern.

Real-world examples: GitHub enforces 5,000 API requests per hour per authenticated user. Stripe caps payment API calls at 100 requests per second per API key. Twilio rate-limits SMS sends per account to prevent spam. Each of these requires a distributed, high-performance limiter that adds no more than a few milliseconds of overhead.

Step 1 — Clarify Requirements

Before choosing an algorithm, nail down the constraints:

  • Granularity: limit per user? per IP? per API key? per endpoint?
  • Window semantics: fixed window (e.g. "60 requests per minute, resets at :00")? sliding window? token bucket (steady drip)?
  • Enforcement point: API gateway, service mesh, application middleware, or a shared library?
  • Scale: how many unique limiters? A service with 10 million users needs 10 million counters — that fits in ~80 MB of RAM (8 bytes each), easily cacheable.
  • Accuracy vs. performance: is it acceptable to allow a small overage, or must limits be exact?
  • Response on breach: hard reject with 429 Too Many Requests? Soft throttle? Queue?

Step 2 — Rate-Limiting Algorithms

There are four widely used algorithms. Each solves a different problem.

Fixed Window Counter

Divide time into fixed buckets (e.g. each minute). Increment a counter for each request; reject once the counter exceeds the limit. Reset at the boundary.

  • Pro: O(1) memory per client; trivially implemented with a single Redis INCR + EXPIRE.
  • Con: boundary burst. If the limit is 100 req/min and a client sends 100 at 00:59 and 100 more at 01:01, they get 200 requests in 2 seconds — double the intended rate.

Sliding Window Log

Store a sorted set (timestamp log) of every request. On each new request, remove entries older than the window, count the remaining entries, and allow or reject.

  • Pro: Perfectly accurate; no boundary burst.
  • Con: Memory proportional to request count per client. At 1,000 req/min per client, storing each timestamp costs ~50 bytes → 50 KB per heavy client. Redis sorted sets work well here, but memory can spike.

Sliding Window Counter (Hybrid)

Keep two fixed-window buckets: the current minute and the previous minute. Weight them by how far into the current window you are:

estimated_count = prev_bucket_count * (1 - elapsed_in_current_window / window_size) + current_bucket_count

This approximation is accurate to within ~0.1% for smooth traffic and costs only 2 counters per client. It is the algorithm used by Cloudflare for their 429-limit rules at trillion-request scale.

Token Bucket

Each client has a bucket holding up to capacity tokens. Tokens refill at a fixed rate (e.g. 10/sec). Each request consumes one token; if the bucket is empty the request is rejected. The bucket is defined by two numbers: capacity (burst ceiling) and refill rate (sustained rate).

  • Pro: Naturally allows short bursts up to capacity; smooth sustained rate. Amazon API Gateway uses this model.
  • Con: Requires storing last-refill timestamp and current token count. Concurrency control is needed when multiple nodes update the same bucket.
Rule of thumb: Use sliding window counter when you need accuracy with minimal memory. Use token bucket when you want to allow legitimate bursts (e.g. a mobile app that sends a flurry of messages when it reconnects).
Four rate-limiting algorithms compared Fixed Window :00–:59 cnt=98 01:00– cnt=0 ⚠ boundary burst Sliding Log Sorted Set (timestamps) t-59s … t-2s … t-0s count = 97 → allow ✓ exact accuracy Sliding Counter prev cnt=60 curr cnt=30 est = 60×0.4 + 30 = 54 ✓ low memory Token Bucket cap=100 refill 10/sec ✓ burst-friendly Algorithm Comparison Algorithm Memory Accuracy Burst handling Fixed Window O(1) Boundary burst Poor Sliding Log O(n) per client Exact N/A Sliding Counter O(1) ~99.9% Fair Token Bucket O(1) Exact (sustained) Excellent
The four main rate-limiting algorithms and their trade-offs.

Step 3 — Distributed Storage with Redis

A single-server rate limiter is straightforward. The hard part is making it work across a cluster of API gateway nodes so that counters are consistent regardless of which node handles the request.

Redis is the canonical solution. It is single-threaded (so operations are atomic by default) and supports the key primitives needed:

  • INCR key + EXPIRE key seconds — fixed window counter in two commands.
  • ZADD key timestamp member + ZREMRANGEBYSCORE + ZCARD — sliding window log.
  • Lua scripts — bundle multiple commands into a single atomic operation, eliminating race conditions between check and increment.
Use a Lua script for the sliding window counter. A Lua script runs atomically on the Redis server; no other client can interleave commands between the read and the write. This is far simpler than distributed locks and adds essentially zero latency.
-- Sliding window counter (Lua, runs atomically on Redis) local key_curr = KEYS[1] -- "rl:{user}:{minute}" local key_prev = KEYS[2] -- "rl:{user}:{prev_minute}" local limit = tonumber(ARGV[1]) local now_frac = tonumber(ARGV[2]) -- e.g. 0.75 = 45s into current minute local prev = tonumber(redis.call('GET', key_prev) or 0) local curr = tonumber(redis.call('GET', key_curr) or 0) local estimate = prev * (1 - now_frac) + curr if estimate >= limit then return 0 -- rejected end redis.call('INCR', key_curr) redis.call('EXPIRE', key_curr, 120) return 1 -- allowed

Step 4 — Distributed Architecture

At high scale the rate limiter itself must not become a bottleneck. The reference architecture places a thin rate-limiter sidecar (or API gateway rule) in front of every backend service. Each instance communicates with a small Redis cluster — not one global Redis, because a single Redis instance tops out at ~100–200k ops/sec.

Key design decisions:

  • Shard by client key: hash(user_id) % num_redis_nodes means each user's counter always lands on the same node — no cross-shard coordination needed.
  • Local cache for hot keys: A 10 ms in-process cache per gateway node prevents every request hitting Redis. The tradeoff is a brief window where a client can slightly exceed the limit across nodes before the cache expires. For most use cases this is acceptable.
  • Fallback policy: If Redis is unreachable, decide upfront: fail open (allow the request) or fail closed (reject). Fail open preserves availability; fail closed prevents abuse during outages. Most production systems fail open with an alert.
  • Return headers: always set X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset so clients can back off gracefully rather than hammering a 429 wall.
Distributed rate limiter architecture with Redis sharding Client A Client B API Gateway + Rate Limiter Local Cache (10ms TTL) Lua Script (atomic check) Headers X-RateLimit-* INCR / Lua Redis Cluster (sharded by key) Node 0 keys A–H Node 1 keys I–P Node 2 keys Q–Z Node 3 (replica) allow / deny Backend Services Allowed only 429 if denied Fail open if Redis unreachable
Distributed rate limiter: API gateway applies local cache + Lua check against a sharded Redis cluster.

Step 5 — Edge Cases and Production Concerns

  • Clock skew: When multiple nodes compute "which minute are we in?" using their local clocks, small clock differences (~1 s) can cause counters to land in different buckets. Synchronize clocks with NTP and use a window size of at least 10 seconds to make 1-second skew irrelevant.
  • Race condition on first request: Two nodes simultaneously doing GET → 0 → INCR can both allow a request that should be the last. A Lua script or Redis atomic SET NX eliminates this.
  • Multi-tier limiting: Apply limits at multiple scopes simultaneously — per IP (to block bots), per user (fair use), per API key (SLA tiers). Store each as a separate Redis key; check all three in one pipeline round-trip.
  • Distributed Lua caveat: In Redis Cluster, all keys in a Lua script must hash to the same slot. Use hash tags: rl:{user_id}:curr and rl:{user_id}:prev both contain {user_id}, so Redis routes them to the same node.
Avoid storing rate-limit state in your application database. At 50,000 req/sec, every rate-limit check is a write. Even a fast PostgreSQL will buckle. Redis handles this at sub-millisecond latency with no contention — that is precisely why it was built.

Summary

A production-grade distributed rate limiter combines the right algorithm (sliding window counter for accuracy with low memory, token bucket for burst tolerance) with Redis for atomic, low-latency counter storage, Lua scripts for race-free updates, key sharding for horizontal scale, and a clear fallback policy when storage is unavailable. The result is a component that adds fewer than 2 ms to each request while protecting your backend from tens of thousands of abusive clients per second.