Data Consistency & Replication

Multi-Leader & Leaderless Replication

18 min Lesson 6 of 10

Multi-Leader & Leaderless Replication

Single-leader replication is simple to reason about, but it creates a bottleneck: every write must travel through one node. If that leader is in a data center on the other side of the world from your users, or if it simply cannot keep up with write volume, you hit a ceiling. Two architectural patterns break that ceiling in very different ways: multi-leader replication and leaderless replication (the Dynamo style). Both trade away some simplicity in exchange for massive gains in write throughput, geographic distribution, and fault tolerance.

Multi-Leader Replication

In a multi-leader setup, more than one node is allowed to accept writes. Each leader asynchronously replicates its writes to every other leader (and optionally to followers). The most common deployment is one leader per data center: writes from users in Europe go to the EU leader, writes from users in the US go to the US leader, and both leaders synchronize with each other in the background.

Real systems that use this pattern include CockroachDB (multi-region), MySQL Group Replication, PostgreSQL BDR (Bi-Directional Replication), and Google Docs (operational-transform style multi-master for collaborative editing).

Advantages:

Local write latency — each region writes to its nearby leader without a cross-continental round trip.
Continues accepting writes even if one data center goes offline.
Write throughput scales horizontally: two leaders roughly double write capacity.

The price: write conflicts. If two leaders accept different writes to the same record at the same time, you have a conflict that must be resolved. This is the central challenge of multi-leader systems.

Conflict Resolution Strategies

There is no single right answer; you choose based on semantics:

Last-Write-Wins (LWW): Attach a timestamp to every write; the higher timestamp wins. Simple but dangerous — clocks on distributed machines drift, so you can silently discard the "real" update. Used by Cassandra by default.
Custom merge logic: The application defines what "merge" means. A shopping cart might union both sets of items. A counter might add the deltas. CRDTs (Conflict-free Replicated Data Types) formalize this into data structures where any merge always produces the same result regardless of order.
On-write vs. on-read: Detect the conflict at write time and reject or queue it, or store all conflicting versions and resolve lazily when the data is next read (Amazon's shopping-cart model).

Multi-leader is hard to get right. Even databases that advertise multi-master support often have subtle bugs in their conflict resolution. If you can avoid multi-leader (e.g., by sharding writes so each shard has exactly one leader), do so. Only reach for it when you genuinely need sub-10 ms write latency from multiple continents or when write throughput cannot fit on one machine.

Two-datacenter multi-leader setup: each leader accepts writes locally and asynchronously replicates to the other. Simultaneous writes to the same record produce a conflict that must be resolved.

Leaderless Replication — The Dynamo Model

Amazon's Dynamo paper (2007) — and the open-source systems it inspired, chiefly Apache Cassandra and Riak — took a different approach: eliminate the leader concept entirely. Any replica can accept any write. Reads are sent to multiple replicas simultaneously, and the client (or a coordinator node) reconciles any differences.

The mathematics that make this work are the quorum rules. Given a cluster of N replicas:

Every write is sent to W replicas; it succeeds when W confirm it.
Every read is sent to R replicas; the client takes the most recent value.
If W + R > N, at least one node will be in both the write set and the read set, guaranteeing the read sees the latest write.

A typical production configuration is N=3, W=2, R=2. This tolerates one replica being down for both reads and writes (since you only need 2 of 3). To lean toward availability, use W=1, R=1 (fast but stale reads possible). To lean toward consistency, use W=3, R=1 (every write confirmed by all — slower, but reads are always fresh).

Quorum does not guarantee strong consistency in all cases. Network partitions can create "sloppy quorums" (Cassandra writes to available nodes even outside the preferred set via hinted handoff), and concurrent writes can still create version conflicts. Leaderless systems typically offer eventual consistency with tunable guarantees, not strict serializability.

Read Repair and Anti-Entropy

When a leaderless system reads from R replicas and finds that some are behind, it repairs them on the fly: the coordinator writes the latest version back to stale replicas before returning the answer to the client. This is called read repair. A background anti-entropy process also continuously compares replicas using a Merkle tree (a hash tree of data segments) and syncs any divergence, even for data that nobody reads. Together they ensure replicas converge even after prolonged outages.

Version Vectors and Conflict Detection

How does the system know which of two conflicting values is "newer"? It cannot rely on wall-clock time alone. Instead, each value carries a version vector (a map from node ID to sequence number, similar to a vector clock). When two writes happen concurrently on different nodes and neither version vector is strictly greater than the other, they are siblings — concurrent conflicts that the application must resolve (usually via a merge function or CRDT). When one vector dominates the other, the greater one is definitively newer and replaces it.

Leaderless quorum write with N=3, W=2: the coordinator sends the write to all three replicas; two confirmations satisfy the quorum and the client gets a success. Replica C is repaired asynchronously.

Multi-Leader vs Leaderless: When to Use Which

Both patterns sacrifice simplicity for scale. The right choice depends on your primary constraint:

Use multi-leader when you need low write latency from multiple geographically distinct regions and your data model allows you to write conflict resolution logic (or you use CRDTs).
Use leaderless (Dynamo-style) when you need extreme write availability (the system keeps accepting writes even when multiple nodes are down) and eventual consistency is acceptable — think shopping carts, sensor data, analytics counters, or social media timelines.
For both, be prepared to handle the complexity they introduce at the application layer. If your workload fits on a single leader with one or two read replicas, that is almost always the right choice.

Cassandra tuning tip: In Cassandra clusters that span multiple data centers, a common production pattern is LOCAL_QUORUM consistency level: the quorum is satisfied by replicas in the local data center only. This gives you intra-DC consistency with low latency while still replicating asynchronously to remote DCs for disaster recovery.