Real-World System Design Case Studies

Design a Distributed Key-Value Store

18 min Lesson 9 of 10

Design a Distributed Key-Value Store

A distributed key-value store is the backbone of countless large-scale systems. Redis, DynamoDB, Cassandra, Riak — they all share the same fundamental challenge: how do you spread data across dozens or hundreds of nodes, keep it replicated for fault-tolerance, and still let clients find any key in sub-millisecond time? This lesson dissects the two core mechanisms that make that possible: consistent hashing (where does a key live?) and replication with partitioning (how many copies exist and how are they distributed?).

Requirements Snapshot

  • Functional: put(key, value), get(key), delete(key). Keys and values are arbitrary byte strings.
  • Non-functional: read and write latency under 10 ms at p99; data survives the failure of any single node (or even a rack); horizontal write scalability to tens of nodes; tunable consistency (strong vs eventual).
  • Scale: 1 TB of data today, 10 TB in two years; 50 000 read QPS + 5 000 write QPS at peak.
Key idea: A key-value store is deliberately simple at the API level — three operations. All the complexity lives in the distribution layer: deciding which node owns a key, replicating it for durability, and resolving conflicts when nodes disagree.

The Naive Approach — and Why It Breaks

The simplest partitioning strategy is modular hashing: node = hash(key) % N. With 4 nodes, key "user:42" might hash to 2 and always live on Node 2. This works until you add a fifth node. Now almost every key remaps to a different node. You must migrate ~80% of your data — an expensive, error-prone operation that leaves the cluster inconsistent for hours. This is not acceptable for a live system.

Consistent Hashing

Consistent hashing solves the remapping problem by placing both keys and nodes on the same circular hash ring (a number line from 0 to 232−1 that wraps around). To find a key's node, hash the key to a point on the ring, then walk clockwise until you hit the first node. This is called the key's coordinator node.

When you add a new node, it only inherits the keys that fall between it and its counter-clockwise neighbour. With 100 nodes, adding one node moves roughly 1% of keys — not 80%. Removing a node hands its keys to the next clockwise node.

Consistent hashing ring with virtual nodes Node A pos 0 Node B pos 90 Node C pos 180 Node D pos 270 user:42 session:9 A' vnode Legend Physical node Key position Virtual node clockwise lookup
Consistent hashing ring: keys and nodes share the same circular space. A key routes to the first node encountered clockwise. Virtual nodes (A') spread load more evenly.

Virtual Nodes — Handling Uneven Load

One problem with the basic ring is that nodes may end up with very different slice sizes if their hash positions cluster together. The solution is virtual nodes (vnodes): instead of placing each physical node once, place it at N random positions on the ring (DynamoDB uses 150–200 vnodes per physical node). Each physical node now owns many small, scattered arcs instead of one large arc, so load naturally balances even when physical nodes have different capacities. A more powerful server can be assigned more vnodes.

Best practice: Use 100–200 virtual nodes per physical server. The overhead is minimal (a sorted array of token-to-node mappings) but the load balance improvement is dramatic, especially with fewer than 20 physical nodes.

Replication and Partitioning

Consistent hashing tells you where the primary copy lives. But a single copy is not durable — if that node fails, the data is gone. The standard solution is to replicate each key to the next N clockwise nodes on the ring (N is usually 3). These are called the preference list.

For example, with N=3, the key "user:42" is stored on Node B (coordinator), Node C (first replica), and Node D (second replica). If Node B crashes, Node C automatically takes over reads and writes. Clients never notice — they just hit the preference list.

Replication with N=3 preference list across partitions Client put(key, val) Node B Coordinator Primary copy Node C Replica 1 async write Node D Replica 2 async write Quorum (N=3) W=2 (strong write) R=2 (strong read) W+R > N = consistent W=1, R=1 = eventual replicate replicate ack W=2
Replication with N=3: the coordinator writes to itself and replicates to 2 more nodes. Quorum (W+R > N) determines whether reads and writes are strongly consistent or eventually consistent.

Quorum Reads and Writes

With N=3 replicas, you can tune consistency with two parameters: W (write quorum — how many nodes must acknowledge before the write returns) and R (read quorum — how many nodes must respond before the read returns). The rule is:

  • W + R > N — strong consistency. At least one node in every read overlaps with every write. Cassandra and DynamoDB both support this with QUORUM consistency level.
  • W = 1, R = 1 — eventual consistency. Fastest path, lowest latency, but you might read a stale value in the milliseconds before replication completes.
  • W = N, R = 1 — read-optimised. Every node must confirm the write (slow writes, fast reads).

Real-world choices: DynamoDB defaults to eventual consistency (faster, cheaper) but lets you request strongly consistent reads for an extra cost. Cassandra lets developers choose per-query.

Handling Node Failures — Hinted Handoff and Anti-Entropy

What happens when one of the N replica nodes is temporarily down during a write? The coordinator uses hinted handoff: it writes the data to a healthy node outside the preference list and attaches a "hint" — "deliver this to Node C once it recovers". When Node C comes back, the hinted node forwards the pending writes. This preserves write availability during transient failures without sacrificing durability.

For longer outages or detecting silent divergence (bit rot, bugs), the standard tool is Merkle trees. Each node maintains a hash tree over its data. Comparing two nodes means comparing their root hashes in O(1); if they differ, you traverse the tree to find only the diverged leaf ranges. Cassandra uses this for its anti-entropy repair process.

Pitfall — write conflicts: With eventual consistency, two clients can write different values to the same key on different replicas during a network partition. When the partition heals, you have a conflict. You must choose a resolution strategy: Last Write Wins (LWW — simple but can lose data), vector clocks (correct but complex), or CRDT structures (merge-friendly data types). DynamoDB uses LWW by default; Riak supports vector clocks.

Putting It Together — Full Architecture

A production distributed key-value store combines all of the above:

  1. Consistent hashing ring with 150 vnodes per physical node to distribute partitions evenly.
  2. N=3 replication across different racks or availability zones so a single rack failure does not cause data loss.
  3. Tunable quorum (W, R configurable per request) to let high-throughput use-cases trade consistency for latency.
  4. Hinted handoff for transient node failures; Merkle tree anti-entropy for long-term divergence repair.
  5. Gossip protocol for membership — every node periodically shares its view of the ring with a random peer. No single point of failure for cluster topology.
  6. LSM-Tree storage engine (like RocksDB) for high write throughput — writes go to an in-memory memtable, then flush to sorted SSTables on disk. Background compaction merges SSTables. Read performance from Bloom filters that skip irrelevant files.
Key idea: The CAP theorem says a distributed system can guarantee at most two of: Consistency, Availability, Partition-tolerance. During a network partition you must choose: reject writes (CP — consistent but unavailable) or accept writes on both sides (AP — available but potentially inconsistent). DynamoDB and Cassandra choose AP. Zookeeper chooses CP. Know which one your use-case needs before you design.