Data Consistency & Replication

Leader-Follower Replication

18 min Lesson 5 of 10

Leader-Follower Replication

Every database that must survive hardware failure needs at least two copies of its data. Leader-follower replication — also called primary-replica or master-slave replication — is the most widely deployed strategy for achieving this. One node, the leader, accepts all writes. One or more followers receive a continuous stream of changes from the leader and keep their own copies up to date. Reads can be served from any node in the cluster.

PostgreSQL, MySQL, MongoDB, Redis, Kafka, and dozens of other systems all ship leader-follower replication as a first-class feature. Understanding how it works — and where it breaks — is foundational for anyone designing systems at scale.

How Replication Works

When a write lands on the leader, the leader appends it to a durable, ordered log — called the Write-Ahead Log (WAL) in PostgreSQL or the binary log (binlog) in MySQL. Followers maintain a persistent connection to the leader and stream this log continuously. Each follower replays the same operations in the same order, which guarantees that their local copy converges to the same state as the leader.

Writes always go to the Leader; the Leader streams changes to Followers via WAL/binlog; Reads can be served from any node.

Synchronous vs Asynchronous Replication

The critical design choice in any leader-follower setup is when the leader considers a write committed. There are two fundamental modes:

Synchronous Replication

The leader waits for at least one follower to acknowledge that it has written the change to durable storage before returning success to the client. Only then does the leader acknowledge the write.

Guarantee: If the leader fails immediately after the write, the designated synchronous follower has a complete, up-to-date copy. Zero data loss on leader failure.
Cost: Every write now requires a network round-trip to the follower. If the follower is in a different data center 50 ms away, every write is at least 100 ms slower. If the synchronous follower crashes, writes block entirely until it recovers or is reconfigured.

Real-world example: PostgreSQL lets you mark individual replicas as synchronous_standby_names. AWS RDS Multi-AZ uses synchronous replication to the standby replica — writes are held until the standby confirms. That is why RDS Multi-AZ has slightly higher write latency than single-AZ deployments, but guarantees an RPO (Recovery Point Objective) of zero.

Asynchronous Replication

The leader writes to its own durable storage and immediately returns success to the client, without waiting for any follower to confirm. The replication log is shipped to followers in the background, typically within milliseconds.

Gain: Write latency is determined entirely by the leader's local disk. Cross-region followers do not slow down writes at all.
Risk: Replication lag. If the leader fails before a follower has replicated the last few writes, those writes are permanently lost. This is called replication lag data loss.

Real-world numbers: In a healthy same-datacenter deployment, PostgreSQL asynchronous replication lag is typically under 10 ms. Under write load or across regions it can grow to hundreds of milliseconds or even seconds. Any read from a lagging follower may return stale data — a classic eventual consistency scenario.

Synchronous replication waits for follower acknowledgment before confirming the write; asynchronous replication confirms immediately and ships changes in the background.

Semi-synchronous replication (used by MySQL and Google Spanner) is the pragmatic middle ground: the leader waits for at least one follower to acknowledge, but all other followers remain asynchronous. You get durability without requiring the entire fleet to be healthy for writes to proceed.

Replication Lag: The Silent Bug Factory

Asynchronous replication creates a window — often called the replication lag — during which a follower's data is slightly behind the leader. This produces surprising application bugs:

Read-your-own-writes violation: A user submits a profile update. The write goes to the leader. One second later they refresh; their request is routed to a lagging follower. The old profile appears. The user concludes the form did not work and submits again, creating a duplicate.
Monotonic reads violation: User A reads comment count from Follower 1 (lag 0 ms) and sees 42. They refresh; the request hits Follower 2 (lag 800 ms) and returns 38. Time appears to go backward.
Causality violations: User A posts a message; User B sees User A's reply but not the original message (the reply was on the leader, the question still in transit to the follower).

Mitigations for read-your-own-writes: Route reads for a specific user to the leader for at least one second after any write by that user, or track a replication position in a cookie and only serve reads from followers whose log is at or past that position. Both approaches are used in production at large scale.

Failover: Promoting a Follower to Leader

When a leader crashes, the system must elect a new leader from among the surviving followers. This process is called failover, and it is where replication mode has the most dramatic consequences.

Detecting failure: Most systems use a timeout-based approach. If the leader has not sent a heartbeat within N seconds (typically 10–30 s), followers conclude it has failed. This timeout must be tuned carefully — too short causes false positives during transient network hiccups; too long means extended downtime.

Choosing the new leader: The follower with the most up-to-date replication log is the safest choice to minimize data loss. In an automatically managed cluster (PostgreSQL Patroni, MySQL Group Replication, MongoDB replica sets), an election algorithm — often based on Raft, which you will study in Lesson 7 — handles this choice.

The data loss risk with async replication: If the old leader was 50 writes ahead of all followers at the moment it crashed, those 50 writes are gone. Worse, if the old leader comes back online (say, it was only partitioned, not dead), it now has writes that the new leader does not know about. This creates a split-brain scenario: two nodes both believe they are the authoritative leader. Any writes accepted by the resurrected old leader must either be discarded or merged — neither is trivial.

Never allow two leaders simultaneously. Split-brain is one of the most dangerous failure modes in distributed databases. Automatic failover systems use a technique called fencing (STONITH — Shoot The Other Node In The Head) to guarantee that the old leader is completely disabled before the new one begins accepting writes. Without fencing, split-brain will silently corrupt data.

Read Scaling and the Capacity Model

One of the key motivations for follower replicas is read scaling. If 95% of your traffic is reads, adding three read replicas triples your read capacity without changing the leader at all. This is exactly the model used by virtually every large web application:

Wikipedia runs dozens of MariaDB read replicas globally, routing reader traffic to the nearest one while all edits go to the primary.
Stack Overflow serves the majority of its SQL reads from replicas, reserving the primary for writes and cache invalidation queries.
Most e-commerce product catalog pages are read from replicas; checkout (a write path) always hits the primary.

Followers do not help write throughput. All writes still funnel through a single leader. If your bottleneck is write capacity (high-volume event logging, time-series ingestion, financial ledgers), leader-follower replication alone cannot help — you need sharding, multi-leader replication (Lesson 6), or a purpose-built write-optimized store.

Choosing Your Configuration

A practical decision matrix:

RPO = 0, writes can tolerate extra latency: Use synchronous replication to at least one follower. AWS RDS Multi-AZ, Google Cloud SQL HA, and PostgreSQL with synchronous_commit = on all deliver this.
Low-latency writes, can tolerate small data loss window: Asynchronous replication with automated failover (Patroni, Orchestrator). Accept that failover may lose the last few seconds of writes.
Cross-region reads with low latency: Async read replicas in each region. Accept that cross-region follower reads may be slightly stale.
Need both durability and speed: Semi-synchronous replication — one sync follower in the same DC, all others async, including cross-region replicas.