Eventual Consistency in Practice
Eventual Consistency in Practice
Eventual consistency is one of the most widely deployed — and most widely misunderstood — consistency models in distributed systems. The promise is simple: if no new writes are made to a piece of data, all replicas will converge to the same value eventually. But that word "eventually" hides enormous practical complexity: how long is eventually? What happens when replicas disagree? How do you merge conflicting writes? This lesson answers all three questions with real-world numbers and concrete resolution strategies.
ALL, DynamoDB global tables, and every social media "like" counter — all eventual. It is the default model for any system that prioritizes availability and partition tolerance over immediate consistency.
How Replication Lag Creates Staleness
Consider a leader-follower setup where the primary receives a write and asynchronously ships it to replicas. Under normal conditions, replication lag is 10–100 milliseconds. But during a network hiccup, a follower restart, or a traffic spike, lag can balloon to seconds or even minutes. Any read served by that follower during that window returns stale data.
Real numbers from production systems: Facebook's TAO (the social graph store) tolerates tens of milliseconds of staleness for most edge reads. Amazon DynamoDB global tables quote sub-second replication between regions under normal conditions, but the SLA explicitly allows for higher lag during network partitions. The key insight is that staleness is a spectrum, not a binary on/off switch — and your system design must account for the acceptable range.
Conflict Resolution Strategies
When multiple nodes accept writes to the same key concurrently — common in multi-leader or leaderless topologies — you get write conflicts. Eventual consistency does not prevent conflicts; it requires you to resolve them deterministically. Here are the four dominant strategies:
1. Last Write Wins (LWW)
Each write is tagged with a timestamp (often a physical or logical clock). On conflict, the write with the highest timestamp wins. Cassandra uses LWW by default. It is simple and fast — no coordination needed — but it silently discards data. If two clients write different values at nearly the same wall-clock time, the "losing" write disappears with no error.
ConditionExpression) and Cassandra's lightweight transactions (IF NOT EXISTS) exist precisely to escape this trap.
2. Merge / CRDT (Conflict-free Replicated Data Types)
Design your data type so that concurrent writes can always be merged without ambiguity. A CRDT counter never conflicts — each node tracks its own increments; the merge is a per-node max. A CRDT set can only add elements (G-Set) or uses a separate tombstone set for removals (2P-Set). Redis and Riak natively support several CRDT types. CRDTs are the gold standard for collaborative tools: Google Docs uses an operational-transform variant; Figma uses a CRDT for simultaneous edits.
3. Application-Level Conflict Resolution
Read the conflicting versions and merge them in application code. DynamoDB returns all conflicting versions (siblings) when you use vector clocks. Your application decides the correct merged state and writes it back. This is powerful but requires careful design for every conflict scenario. Shopping carts are the textbook example: if two devices add different items, the safe merge is the union of both carts. Amazon's Dynamo paper (2007) popularized this approach.
4. Version Vectors / Vector Clocks
Each replica tracks a version vector: a map of node_id → write_count. When replica A has version {A:3, B:2} and replica B has {A:3, B:3}, B is strictly ahead — no conflict. When A has {A:4, B:2} and B has {A:3, B:3}, they are concurrent — conflict detected. This is how Riak and Voldemort detect whether one value causally supersedes another before invoking merge logic.
Eventual Consistency in Practice: Three Real-World Scenarios
Scenario 1 — Social Media Like Counts
Facebook, Instagram, and Twitter all show like/view counters that are eventually consistent. Counts are maintained in sharded counter stores (Facebook uses Scuba and Tao; Twitter used Manhattan). Each shard aggregates its local count; a background process periodically merges totals. A user may see slightly different counts on different page refreshes — this is acceptable because approximate counts are sufficient and the cost of strong consistency (global coordination) would be prohibitive at billions of writes per day. The counter is a CRDT-style G-Counter merge: the correct merged value is the sum of per-shard increments.
Scenario 2 — Shopping Cart (Amazon Dynamo)
Amazon's Dynamo paper described how the shopping cart must never fail to accept an add even during network partitions. The consequence: two disconnected nodes each accept different items. On reconnect, both versions (siblings) exist. Dynamo returns all siblings to the application, which merges them by taking the union of items. This means a removed item might reappear after a partition — Amazon chose this trade-off deliberately, preferring a ghost item over a lost item. The lesson: choose your failure mode consciously.
Scenario 3 — DNS Propagation
DNS is the oldest and most globally deployed eventually consistent system. TTL values determine how long resolvers cache records — typically 300 seconds to 86400 seconds. After you update an A record, all resolvers gradually evict their cached value and fetch the new one. During the convergence window (potentially hours for high-TTL records), some clients resolve to the old IP and some to the new one. The mitigation: lower your TTL to 300 seconds a day before any planned cutover, giving caches time to drain before you make the change.
Designing for Eventual Consistency
Accepting eventual consistency in your system requires deliberate design at every layer:
- Make operations idempotent. If a write may be replayed (due to retries or at-least-once delivery), the same write applied twice must produce the same state as applying it once. Use unique write IDs or conditional writes (
PUT IF version == X). - Read your own writes. After a user submits a form, route their next read to the primary (or use sticky sessions / version tokens) so they see the result of their own action immediately, even while replicas catch up.
- Use monotonic reads. Once a client has seen version V, never show them a version older than V. Sticky sessions or read-your-writes tokens on a per-client basis prevent backward time travel.
- Expose staleness to the UI. In dashboards and analytics, show "data as of 5 minutes ago" rather than pretending counts are live. Users are far more tolerant of acknowledged staleness than of unexplained inconsistencies.
Eventual consistency is not a compromise — for many workloads it is the correct choice. When availability and low-latency writes matter more than instantaneous global agreement, accepting a bounded convergence window and a clear conflict policy gives you a system that can scale to any geography without coordination bottlenecks.