Data Consistency & Replication

Eventual Consistency in Practice

18 min Lesson 3 of 10

Eventual Consistency in Practice

Eventual consistency is one of the most widely deployed — and most widely misunderstood — consistency models in distributed systems. The promise is simple: if no new writes are made to a piece of data, all replicas will converge to the same value eventually. But that word "eventually" hides enormous practical complexity: how long is eventually? What happens when replicas disagree? How do you merge conflicting writes? This lesson answers all three questions with real-world numbers and concrete resolution strategies.

Where you already use eventual consistency: DNS propagation (changes take minutes to hours to reach all resolvers), Amazon S3 cross-region replication, Cassandra with quorum below ALL, DynamoDB global tables, and every social media "like" counter — all eventual. It is the default model for any system that prioritizes availability and partition tolerance over immediate consistency.

How Replication Lag Creates Staleness

Consider a leader-follower setup where the primary receives a write and asynchronously ships it to replicas. Under normal conditions, replication lag is 10–100 milliseconds. But during a network hiccup, a follower restart, or a traffic spike, lag can balloon to seconds or even minutes. Any read served by that follower during that window returns stale data.

Real numbers from production systems: Facebook's TAO (the social graph store) tolerates tens of milliseconds of staleness for most edge reads. Amazon DynamoDB global tables quote sub-second replication between regions under normal conditions, but the SLA explicitly allows for higher lag during network partitions. The key insight is that staleness is a spectrum, not a binary on/off switch — and your system design must account for the acceptable range.

Eventual Consistency — Replication Lag and Staleness Window Client writes & reads Primary x = 42 (latest) Replica A x = 42 ✓ synced Replica B x = 37 ✗ stale Replica C x = 37 ✗ stale Reader sees x = 37! write x=42 replicated pending... read Staleness window: lag > 0 10ms (normal) — minutes (partition)
A primary with three replicas: Replica A has caught up; Replicas B and C are lagging, causing a reader to see a stale value of x=37 instead of the latest x=42.

Conflict Resolution Strategies

When multiple nodes accept writes to the same key concurrently — common in multi-leader or leaderless topologies — you get write conflicts. Eventual consistency does not prevent conflicts; it requires you to resolve them deterministically. Here are the four dominant strategies:

1. Last Write Wins (LWW)

Each write is tagged with a timestamp (often a physical or logical clock). On conflict, the write with the highest timestamp wins. Cassandra uses LWW by default. It is simple and fast — no coordination needed — but it silently discards data. If two clients write different values at nearly the same wall-clock time, the "losing" write disappears with no error.

LWW danger: Clock skew between nodes can cause newer writes to lose to older ones. Never use LWW for financial data, inventory counts, or any field where silent data loss is unacceptable. Amazon DynamoDB's conditional writes (ConditionExpression) and Cassandra's lightweight transactions (IF NOT EXISTS) exist precisely to escape this trap.

2. Merge / CRDT (Conflict-free Replicated Data Types)

Design your data type so that concurrent writes can always be merged without ambiguity. A CRDT counter never conflicts — each node tracks its own increments; the merge is a per-node max. A CRDT set can only add elements (G-Set) or uses a separate tombstone set for removals (2P-Set). Redis and Riak natively support several CRDT types. CRDTs are the gold standard for collaborative tools: Google Docs uses an operational-transform variant; Figma uses a CRDT for simultaneous edits.

3. Application-Level Conflict Resolution

Read the conflicting versions and merge them in application code. DynamoDB returns all conflicting versions (siblings) when you use vector clocks. Your application decides the correct merged state and writes it back. This is powerful but requires careful design for every conflict scenario. Shopping carts are the textbook example: if two devices add different items, the safe merge is the union of both carts. Amazon's Dynamo paper (2007) popularized this approach.

4. Version Vectors / Vector Clocks

Each replica tracks a version vector: a map of node_id → write_count. When replica A has version {A:3, B:2} and replica B has {A:3, B:3}, B is strictly ahead — no conflict. When A has {A:4, B:2} and B has {A:3, B:3}, they are concurrent — conflict detected. This is how Riak and Voldemort detect whether one value causally supersedes another before invoking merge logic.

Conflict Resolution Strategies Comparison Strategy How it works Used by Risk Last Write Wins (LWW) Highest timestamp wins; discard loser Cassandra (default), DynamoDB Silent data loss on clock skew CRDT Conflict-free type Mathematically merge-safe by design Redis, Riak, Google Docs (OT) Limited to supported data types only App-Level Merge Custom logic App reads siblings, writes merged result Amazon Dynamo, Riak Complex; every field needs a merge rule Vector Clocks Causal detection Per-node counters detect concurrency Riak, Voldemort, CRDTs internally Overhead grows with number of nodes
Four conflict resolution strategies for eventually consistent systems — compared by mechanism, real-world usage, and key risk.

Eventual Consistency in Practice: Three Real-World Scenarios

Scenario 1 — Social Media Like Counts

Facebook, Instagram, and Twitter all show like/view counters that are eventually consistent. Counts are maintained in sharded counter stores (Facebook uses Scuba and Tao; Twitter used Manhattan). Each shard aggregates its local count; a background process periodically merges totals. A user may see slightly different counts on different page refreshes — this is acceptable because approximate counts are sufficient and the cost of strong consistency (global coordination) would be prohibitive at billions of writes per day. The counter is a CRDT-style G-Counter merge: the correct merged value is the sum of per-shard increments.

Scenario 2 — Shopping Cart (Amazon Dynamo)

Amazon's Dynamo paper described how the shopping cart must never fail to accept an add even during network partitions. The consequence: two disconnected nodes each accept different items. On reconnect, both versions (siblings) exist. Dynamo returns all siblings to the application, which merges them by taking the union of items. This means a removed item might reappear after a partition — Amazon chose this trade-off deliberately, preferring a ghost item over a lost item. The lesson: choose your failure mode consciously.

Scenario 3 — DNS Propagation

DNS is the oldest and most globally deployed eventually consistent system. TTL values determine how long resolvers cache records — typically 300 seconds to 86400 seconds. After you update an A record, all resolvers gradually evict their cached value and fetch the new one. During the convergence window (potentially hours for high-TTL records), some clients resolve to the old IP and some to the new one. The mitigation: lower your TTL to 300 seconds a day before any planned cutover, giving caches time to drain before you make the change.

Designing for Eventual Consistency

Accepting eventual consistency in your system requires deliberate design at every layer:

  • Make operations idempotent. If a write may be replayed (due to retries or at-least-once delivery), the same write applied twice must produce the same state as applying it once. Use unique write IDs or conditional writes (PUT IF version == X).
  • Read your own writes. After a user submits a form, route their next read to the primary (or use sticky sessions / version tokens) so they see the result of their own action immediately, even while replicas catch up.
  • Use monotonic reads. Once a client has seen version V, never show them a version older than V. Sticky sessions or read-your-writes tokens on a per-client basis prevent backward time travel.
  • Expose staleness to the UI. In dashboards and analytics, show "data as of 5 minutes ago" rather than pretending counts are live. Users are far more tolerant of acknowledged staleness than of unexplained inconsistencies.
Best practice — choose conflict resolution before you build: Every eventually consistent store requires a conflict resolution policy. Retrofitting one onto production data is painful and error-prone. Decide LWW vs. CRDT vs. application merge when you choose your database, document the rule per field, and enforce it in code reviews.

Eventual consistency is not a compromise — for many workloads it is the correct choice. When availability and low-latency writes matter more than instantaneous global agreement, accepting a bounded convergence window and a clear conflict policy gives you a system that can scale to any geography without coordination bottlenecks.