Multi-Leader & Leaderless Replication
Multi-Leader & Leaderless Replication
Single-leader replication is simple to reason about, but it creates a bottleneck: every write must travel through one node. If that leader is in a data center on the other side of the world from your users, or if it simply cannot keep up with write volume, you hit a ceiling. Two architectural patterns break that ceiling in very different ways: multi-leader replication and leaderless replication (the Dynamo style). Both trade away some simplicity in exchange for massive gains in write throughput, geographic distribution, and fault tolerance.
Multi-Leader Replication
In a multi-leader setup, more than one node is allowed to accept writes. Each leader asynchronously replicates its writes to every other leader (and optionally to followers). The most common deployment is one leader per data center: writes from users in Europe go to the EU leader, writes from users in the US go to the US leader, and both leaders synchronize with each other in the background.
Real systems that use this pattern include CockroachDB (multi-region), MySQL Group Replication, PostgreSQL BDR (Bi-Directional Replication), and Google Docs (operational-transform style multi-master for collaborative editing).
Advantages:
- Local write latency — each region writes to its nearby leader without a cross-continental round trip.
- Continues accepting writes even if one data center goes offline.
- Write throughput scales horizontally: two leaders roughly double write capacity.
The price: write conflicts. If two leaders accept different writes to the same record at the same time, you have a conflict that must be resolved. This is the central challenge of multi-leader systems.
Conflict Resolution Strategies
There is no single right answer; you choose based on semantics:
- Last-Write-Wins (LWW): Attach a timestamp to every write; the higher timestamp wins. Simple but dangerous — clocks on distributed machines drift, so you can silently discard the "real" update. Used by Cassandra by default.
- Custom merge logic: The application defines what "merge" means. A shopping cart might union both sets of items. A counter might add the deltas. CRDTs (Conflict-free Replicated Data Types) formalize this into data structures where any merge always produces the same result regardless of order.
- On-write vs. on-read: Detect the conflict at write time and reject or queue it, or store all conflicting versions and resolve lazily when the data is next read (Amazon's shopping-cart model).
Leaderless Replication — The Dynamo Model
Amazon's Dynamo paper (2007) — and the open-source systems it inspired, chiefly Apache Cassandra and Riak — took a different approach: eliminate the leader concept entirely. Any replica can accept any write. Reads are sent to multiple replicas simultaneously, and the client (or a coordinator node) reconciles any differences.
The mathematics that make this work are the quorum rules. Given a cluster of N replicas:
- Every write is sent to
Wreplicas; it succeeds whenWconfirm it. - Every read is sent to
Rreplicas; the client takes the most recent value. - If
W + R > N, at least one node will be in both the write set and the read set, guaranteeing the read sees the latest write.
A typical production configuration is N=3, W=2, R=2. This tolerates one replica being down for both reads and writes (since you only need 2 of 3). To lean toward availability, use W=1, R=1 (fast but stale reads possible). To lean toward consistency, use W=3, R=1 (every write confirmed by all — slower, but reads are always fresh).
Read Repair and Anti-Entropy
When a leaderless system reads from R replicas and finds that some are behind, it repairs them on the fly: the coordinator writes the latest version back to stale replicas before returning the answer to the client. This is called read repair. A background anti-entropy process also continuously compares replicas using a Merkle tree (a hash tree of data segments) and syncs any divergence, even for data that nobody reads. Together they ensure replicas converge even after prolonged outages.
Version Vectors and Conflict Detection
How does the system know which of two conflicting values is "newer"? It cannot rely on wall-clock time alone. Instead, each value carries a version vector (a map from node ID to sequence number, similar to a vector clock). When two writes happen concurrently on different nodes and neither version vector is strictly greater than the other, they are siblings — concurrent conflicts that the application must resolve (usually via a merge function or CRDT). When one vector dominates the other, the greater one is definitively newer and replaces it.
Multi-Leader vs Leaderless: When to Use Which
Both patterns sacrifice simplicity for scale. The right choice depends on your primary constraint:
- Use multi-leader when you need low write latency from multiple geographically distinct regions and your data model allows you to write conflict resolution logic (or you use CRDTs).
- Use leaderless (Dynamo-style) when you need extreme write availability (the system keeps accepting writes even when multiple nodes are down) and eventual consistency is acceptable — think shopping carts, sensor data, analytics counters, or social media timelines.
- For both, be prepared to handle the complexity they introduce at the application layer. If your workload fits on a single leader with one or two read replicas, that is almost always the right choice.
LOCAL_QUORUM consistency level: the quorum is satisfied by replicas in the local data center only. This gives you intra-DC consistency with low latency while still replicating asynchronously to remote DCs for disaster recovery.