Design a Distributed Key-Value Store
Design a Distributed Key-Value Store
A distributed key-value store is the backbone of countless large-scale systems. Redis, DynamoDB, Cassandra, Riak — they all share the same fundamental challenge: how do you spread data across dozens or hundreds of nodes, keep it replicated for fault-tolerance, and still let clients find any key in sub-millisecond time? This lesson dissects the two core mechanisms that make that possible: consistent hashing (where does a key live?) and replication with partitioning (how many copies exist and how are they distributed?).
Requirements Snapshot
- Functional:
put(key, value),get(key),delete(key). Keys and values are arbitrary byte strings. - Non-functional: read and write latency under 10 ms at p99; data survives the failure of any single node (or even a rack); horizontal write scalability to tens of nodes; tunable consistency (strong vs eventual).
- Scale: 1 TB of data today, 10 TB in two years; 50 000 read QPS + 5 000 write QPS at peak.
The Naive Approach — and Why It Breaks
The simplest partitioning strategy is modular hashing: node = hash(key) % N. With 4 nodes, key "user:42" might hash to 2 and always live on Node 2. This works until you add a fifth node. Now almost every key remaps to a different node. You must migrate ~80% of your data — an expensive, error-prone operation that leaves the cluster inconsistent for hours. This is not acceptable for a live system.
Consistent Hashing
Consistent hashing solves the remapping problem by placing both keys and nodes on the same circular hash ring (a number line from 0 to 232−1 that wraps around). To find a key's node, hash the key to a point on the ring, then walk clockwise until you hit the first node. This is called the key's coordinator node.
When you add a new node, it only inherits the keys that fall between it and its counter-clockwise neighbour. With 100 nodes, adding one node moves roughly 1% of keys — not 80%. Removing a node hands its keys to the next clockwise node.
Virtual Nodes — Handling Uneven Load
One problem with the basic ring is that nodes may end up with very different slice sizes if their hash positions cluster together. The solution is virtual nodes (vnodes): instead of placing each physical node once, place it at N random positions on the ring (DynamoDB uses 150–200 vnodes per physical node). Each physical node now owns many small, scattered arcs instead of one large arc, so load naturally balances even when physical nodes have different capacities. A more powerful server can be assigned more vnodes.
Replication and Partitioning
Consistent hashing tells you where the primary copy lives. But a single copy is not durable — if that node fails, the data is gone. The standard solution is to replicate each key to the next N clockwise nodes on the ring (N is usually 3). These are called the preference list.
For example, with N=3, the key "user:42" is stored on Node B (coordinator), Node C (first replica), and Node D (second replica). If Node B crashes, Node C automatically takes over reads and writes. Clients never notice — they just hit the preference list.
Quorum Reads and Writes
With N=3 replicas, you can tune consistency with two parameters: W (write quorum — how many nodes must acknowledge before the write returns) and R (read quorum — how many nodes must respond before the read returns). The rule is:
- W + R > N — strong consistency. At least one node in every read overlaps with every write. Cassandra and DynamoDB both support this with
QUORUMconsistency level. - W = 1, R = 1 — eventual consistency. Fastest path, lowest latency, but you might read a stale value in the milliseconds before replication completes.
- W = N, R = 1 — read-optimised. Every node must confirm the write (slow writes, fast reads).
Real-world choices: DynamoDB defaults to eventual consistency (faster, cheaper) but lets you request strongly consistent reads for an extra cost. Cassandra lets developers choose per-query.
Handling Node Failures — Hinted Handoff and Anti-Entropy
What happens when one of the N replica nodes is temporarily down during a write? The coordinator uses hinted handoff: it writes the data to a healthy node outside the preference list and attaches a "hint" — "deliver this to Node C once it recovers". When Node C comes back, the hinted node forwards the pending writes. This preserves write availability during transient failures without sacrificing durability.
For longer outages or detecting silent divergence (bit rot, bugs), the standard tool is Merkle trees. Each node maintains a hash tree over its data. Comparing two nodes means comparing their root hashes in O(1); if they differ, you traverse the tree to find only the diverged leaf ranges. Cassandra uses this for its anti-entropy repair process.
Putting It Together — Full Architecture
A production distributed key-value store combines all of the above:
- Consistent hashing ring with 150 vnodes per physical node to distribute partitions evenly.
- N=3 replication across different racks or availability zones so a single rack failure does not cause data loss.
- Tunable quorum (W, R configurable per request) to let high-throughput use-cases trade consistency for latency.
- Hinted handoff for transient node failures; Merkle tree anti-entropy for long-term divergence repair.
- Gossip protocol for membership — every node periodically shares its view of the ring with a random peer. No single point of failure for cluster topology.
- LSM-Tree storage engine (like RocksDB) for high write throughput — writes go to an in-memory memtable, then flush to sorted SSTables on disk. Background compaction merges SSTables. Read performance from Bloom filters that skip irrelevant files.