Consensus: Raft & Paxos (Introduction)
Consensus: Raft & Paxos (Introduction)
Distributed systems must often make a single collective decision — which node is the current leader?, what is the next log entry?, should this transaction commit? — even when some participants are unreachable or slow. The consensus problem is the challenge of getting a set of nodes to agree on one value in the presence of failures.
Consensus is the engine underneath almost every strong-consistency guarantee you have learned so far: leader election in Kafka, the commit log in etcd, config management in Zookeeper, and the single-leader in CockroachDB all rely on a consensus algorithm.
Why Consensus Is Hard
Agreeing in a room of people is trivial. Agreeing across a network is not, because:
- Messages can be delayed or reordered — a node cannot tell whether a peer is slow or dead.
- Nodes can crash mid-way — a leader that fails after sending a proposal to only some followers leaves the cluster in an ambiguous state.
- No shared clock — you cannot use wall-clock time to break ties reliably.
The FLP Impossibility result (Fischer, Lynch, Paterson 1985) proves that no deterministic algorithm can guarantee consensus in an asynchronous network if even one node can fail. Real systems work around this by using randomisation or partial synchrony assumptions — they accept that progress may momentarily stall, but they ensure that whenever the network is stable they converge.
Paxos — The Classic Algorithm
Paxos (Leslie Lamport, 1989/1998) was the first practical consensus algorithm. It works in two phases:
- Prepare phase — A Proposer sends a
Prepare(n)message with a ballot numbernto a majority of Acceptors. Each acceptor promises not to accept any proposal numbered less thannand returns the highest-numbered proposal it has already accepted (if any). - Accept phase — If the proposer collects promises from a majority, it sends
Accept(n, v)wherevis either its own value or the highest previously-accepted value it received. Acceptors accept the proposal unless they have already promised a higher ballot. Once a majority accepts, the value is chosen.
Paxos is provably correct but notoriously difficult to implement correctly in practice. Variants such as Multi-Paxos (reusing the leader across rounds), Fast Paxos, and Flexible Paxos exist to address performance, but each adds significant implementation complexity. Google's Chubby lock service and Apache Zookeeper (ZAB protocol, a Paxos variant) are real-world deployments.
Raft — Consensus Made Understandable
Raft (Ongaro & Ousterhout, 2014) was explicitly designed to be easier to understand than Paxos while providing equivalent guarantees. It decomposes the problem into three independent sub-problems:
- Leader election — exactly one server is the leader at any given term.
- Log replication — the leader accepts client requests, appends them to its log, and replicates the log to followers.
- Safety — if a log entry is committed on one server it will appear on all future leaders.
Terms and Elections
Raft divides time into numbered terms. Each term begins with an election. If a follower does not hear from a leader within a random election timeout (typically 150–300 ms), it becomes a candidate, increments the term number, votes for itself, and requests votes from peers. A candidate that collects a majority of votes in its term becomes the new leader. Because the timeout is randomised, only one candidate usually starts at a time, avoiding split-vote loops.
Log Replication
Once elected, the leader serialises all writes. A client write is appended to the leader's log and then sent to followers as an AppendEntries RPC. When a majority acknowledges the entry the leader marks it committed, applies it to its state machine, and replies to the client. Followers apply committed entries in order, keeping their state machines identical to the leader's.
Raft vs. Paxos — Practical Comparison
Both algorithms tolerate up to ⌊(N−1)/2⌋ failures in a cluster of N nodes — a 3-node cluster survives 1 failure, a 5-node cluster survives 2. The key differences are in implementation complexity and operational clarity:
Where You Encounter Consensus in Production
You rarely implement a consensus algorithm yourself — instead you use a coordination service built on top of one:
- etcd (Kubernetes' state store) — Raft. Stores cluster config; a quorum of 3 or 5 etcd nodes is standard.
- Apache Zookeeper — ZAB (Zookeeper Atomic Broadcast), a Paxos variant. Used by Kafka for broker metadata (being replaced by KRaft, a Raft-based system).
- Consul — Raft. Service discovery and distributed locks.
- CockroachDB / TiDB — Raft per range/region. Geo-distributed SQL with strong consistency.
- Google Spanner — Paxos per shard, combined with TrueTime for external consistency.
As an engineer designing systems, the practical takeaway is: when you need a single authoritative leader or a strongly-consistent distributed log, reach for etcd or Zookeeper rather than rolling your own. Consensus bugs are notoriously subtle and take years of production battle-testing to stabilise.
Performance Characteristics
Consensus adds latency because every write must complete at least one round trip to a majority of nodes before it can be acknowledged. In a 3-node cluster in the same data centre this is typically 1–5 ms. Across data centres or availability zones (say 50 ms cross-region RTT) a committed write takes at least 50–100 ms. This is why Raft/Paxos clusters are almost always kept within a single region, and geo-replication uses asynchronous or semi-synchronous strategies layered on top.
Write throughput is bounded by the leader's disk and CPU rather than the number of nodes — adding more followers does not increase write throughput. Reads can be served from followers in relaxed configurations, but reading from the leader (or using a leader lease) is required for linearisable reads.