Data Consistency & Replication

Consensus: Raft & Paxos (Introduction)

18 min Lesson 7 of 10

Consensus: Raft & Paxos (Introduction)

Distributed systems must often make a single collective decision — which node is the current leader?, what is the next log entry?, should this transaction commit? — even when some participants are unreachable or slow. The consensus problem is the challenge of getting a set of nodes to agree on one value in the presence of failures.

Consensus is the engine underneath almost every strong-consistency guarantee you have learned so far: leader election in Kafka, the commit log in etcd, config management in Zookeeper, and the single-leader in CockroachDB all rely on a consensus algorithm.

Why Consensus Is Hard

Agreeing in a room of people is trivial. Agreeing across a network is not, because:

  • Messages can be delayed or reordered — a node cannot tell whether a peer is slow or dead.
  • Nodes can crash mid-way — a leader that fails after sending a proposal to only some followers leaves the cluster in an ambiguous state.
  • No shared clock — you cannot use wall-clock time to break ties reliably.

The FLP Impossibility result (Fischer, Lynch, Paterson 1985) proves that no deterministic algorithm can guarantee consensus in an asynchronous network if even one node can fail. Real systems work around this by using randomisation or partial synchrony assumptions — they accept that progress may momentarily stall, but they ensure that whenever the network is stable they converge.

Key idea: A consensus algorithm guarantees safety (nodes never disagree) at all times, and liveness (the cluster eventually makes progress) whenever the network is stable. It trades availability under extreme partitions for correctness — it chooses CP on the CAP triangle.

Paxos — The Classic Algorithm

Paxos (Leslie Lamport, 1989/1998) was the first practical consensus algorithm. It works in two phases:

  1. Prepare phase — A Proposer sends a Prepare(n) message with a ballot number n to a majority of Acceptors. Each acceptor promises not to accept any proposal numbered less than n and returns the highest-numbered proposal it has already accepted (if any).
  2. Accept phase — If the proposer collects promises from a majority, it sends Accept(n, v) where v is either its own value or the highest previously-accepted value it received. Acceptors accept the proposal unless they have already promised a higher ballot. Once a majority accepts, the value is chosen.

Paxos is provably correct but notoriously difficult to implement correctly in practice. Variants such as Multi-Paxos (reusing the leader across rounds), Fast Paxos, and Flexible Paxos exist to address performance, but each adds significant implementation complexity. Google's Chubby lock service and Apache Zookeeper (ZAB protocol, a Paxos variant) are real-world deployments.

Pitfall: Paxos as described in the original paper is a single-decree algorithm (one value). Turning it into a replicated log (Multi-Paxos) requires many engineering decisions the paper leaves unspecified — gap handling, leader lease, membership changes — which is a large source of bugs in production implementations.

Raft — Consensus Made Understandable

Raft (Ongaro & Ousterhout, 2014) was explicitly designed to be easier to understand than Paxos while providing equivalent guarantees. It decomposes the problem into three independent sub-problems:

  • Leader election — exactly one server is the leader at any given term.
  • Log replication — the leader accepts client requests, appends them to its log, and replicates the log to followers.
  • Safety — if a log entry is committed on one server it will appear on all future leaders.

Terms and Elections

Raft divides time into numbered terms. Each term begins with an election. If a follower does not hear from a leader within a random election timeout (typically 150–300 ms), it becomes a candidate, increments the term number, votes for itself, and requests votes from peers. A candidate that collects a majority of votes in its term becomes the new leader. Because the timeout is randomised, only one candidate usually starts at a time, avoiding split-vote loops.

Log Replication

Once elected, the leader serialises all writes. A client write is appended to the leader's log and then sent to followers as an AppendEntries RPC. When a majority acknowledges the entry the leader marks it committed, applies it to its state machine, and replies to the client. Followers apply committed entries in order, keeping their state machines identical to the leader's.

Raft leader election and log replication flow Node A Node B Node C Timeout Candidate T=1 RequestVote(T=1) RequestVote(T=1) VoteGranted VoteGranted LEADER Term = 1 AppendEntries AppendEntries ACK ACK Committed Reply to client Follower B Log appended Follower C Log appended
Raft sequence: Node A times out, wins the election for Term 1, then replicates a log entry to a majority before committing.

Raft vs. Paxos — Practical Comparison

Both algorithms tolerate up to ⌊(N−1)/2⌋ failures in a cluster of N nodes — a 3-node cluster survives 1 failure, a 5-node cluster survives 2. The key differences are in implementation complexity and operational clarity:

Raft vs Paxos comparison Attribute Raft Paxos (Multi-Paxos) Understandability Explicit, well-specified Notoriously complex Leader election Built-in, randomised timeout Not specified — left to impl. Membership changes Joint consensus / single-step No standard approach Real-world usage etcd, Consul, TiKV, CockroachDB Chubby (Google), Zookeeper (ZAB) Fault tolerance (N nodes) ⌊(N-1)/2⌋ failures ⌊(N-1)/2⌋ failures
Raft and Paxos side by side — same fault-tolerance guarantees, very different implementation complexity.

Where You Encounter Consensus in Production

You rarely implement a consensus algorithm yourself — instead you use a coordination service built on top of one:

  • etcd (Kubernetes' state store) — Raft. Stores cluster config; a quorum of 3 or 5 etcd nodes is standard.
  • Apache Zookeeper — ZAB (Zookeeper Atomic Broadcast), a Paxos variant. Used by Kafka for broker metadata (being replaced by KRaft, a Raft-based system).
  • Consul — Raft. Service discovery and distributed locks.
  • CockroachDB / TiDB — Raft per range/region. Geo-distributed SQL with strong consistency.
  • Google Spanner — Paxos per shard, combined with TrueTime for external consistency.

As an engineer designing systems, the practical takeaway is: when you need a single authoritative leader or a strongly-consistent distributed log, reach for etcd or Zookeeper rather than rolling your own. Consensus bugs are notoriously subtle and take years of production battle-testing to stabilise.

Sizing tip: Always run consensus clusters with an odd number of nodes (3 or 5 in practice). A 4-node cluster does not gain fault tolerance over a 3-node cluster — both require 3 nodes to form a quorum — but the 4-node cluster costs more and can suffer a split-brain scenario (2 vs 2) that an odd cluster cannot.

Performance Characteristics

Consensus adds latency because every write must complete at least one round trip to a majority of nodes before it can be acknowledged. In a 3-node cluster in the same data centre this is typically 1–5 ms. Across data centres or availability zones (say 50 ms cross-region RTT) a committed write takes at least 50–100 ms. This is why Raft/Paxos clusters are almost always kept within a single region, and geo-replication uses asynchronous or semi-synchronous strategies layered on top.

Write throughput is bounded by the leader's disk and CPU rather than the number of nodes — adding more followers does not increase write throughput. Reads can be served from followers in relaxed configurations, but reading from the leader (or using a leader lease) is required for linearisable reads.

Mental model: Think of Raft as a highly reliable, append-only notepad where every write requires a majority of the team to sign off before it becomes permanent. The leader holds the pen, and the team elects a new leader automatically if the current one goes silent.