Project: Choose a Consistency Strategy
Project: Choose a Consistency Strategy
Every concept in this tutorial — CAP, consistency models, quorums, replication topologies, Raft, 2PC, Sagas — converges on a single engineering decision: what consistency guarantee does my system actually need, and how do I implement it without over-engineering? This capstone lesson works through that decision end to end, using a realistic e-commerce platform as the running case study. The goal is a repeatable framework you can apply to any system you design.
The Sample System: ShopFleet
ShopFleet is a mid-scale e-commerce platform: 2 million daily active users, 50,000 orders per day, deployed across three AWS regions (us-east-1, eu-west-1, ap-southeast-1). It has six core services:
- Product Catalog — SKU data, images, descriptions
- Inventory — real-time stock counts per SKU per warehouse
- Cart — session-scoped shopping baskets
- Orders — order lifecycle (created → paid → fulfilled → shipped)
- Payments — integration with Stripe / bank rails
- User Profiles — account settings, saved addresses, preferences
Each service has its own database. There is no shared datastore. This is a classic microservices topology — and the consistency problem lives at the seams between these services.
Step 1 — Classify Each Service by Read/Write Profile and Staleness Tolerance
Before picking a consistency model, you must understand two axes for each service: how stale can reads safely be? and what happens if a write is lost or duplicated?
Step 2 — Assign a Consistency Model to Each Service
With the matrix in hand, assign a model to each service and justify it with real consequences:
Payments — Linearizable + 2PC
A payment debit and credit must be atomic. If the Payments service charges the customer but the Order service never receives confirmation (or vice versa), the business has a fraud or revenue problem. The correct tool here is 2PC over XA between the Payments DB and the Orders DB — both live in the same region, latency is under 2 ms, and the transaction completes in under 50 ms. The blocking risk of 2PC is acceptable because we run a highly-available coordinator (Atomikos with a replicated coordinator log on PostgreSQL). Internally, the Payments DB uses a synchronous single-leader replication with a quorum write (w = majority) before returning to the coordinator.
Inventory — Read-Your-Writes + Quorum Reads
Inventory is trickier than it looks. The real risk is overselling: two concurrent checkout flows reading stock = 1 and both successfully placing an order, leaving stock at −1. The solution is a conditional write (optimistic locking via a version counter) on the Inventory DB, not distributed coordination:
If the UPDATE affects 0 rows, the checkout flow retries or surfaces an "out of stock" error. This is a single-node ACID transaction — no 2PC needed. For cross-region reads (showing stock on product pages), a quorum read (r = 2 of 3 replicas) is used, tolerating ~50 ms of stale display data. The actual deduction always happens on the primary.
Orders — Monotonic Reads + Saga Orchestration
An order flows through states: CREATED → PAYMENT_PENDING → PAID → FULFILLMENT_PENDING → SHIPPED → DELIVERED. Each transition touches a different service (Payments, Warehouse, Shipping). This is the ideal use case for an orchestrated Saga: the Order service is the orchestrator; it issues commands, waits for replies, and runs compensating transactions on failure.
For reads, a customer viewing their order history must never see an order "go backwards" (e.g., appear as SHIPPED then revert to PAID). This requires monotonic read consistency — achieved by routing all reads from a given user session to the same replica (session-affinity sticky routing). Sticky routing is implemented at the API gateway with a consistent-hash on the user ID to a replica pool.
Cart — Session Consistency + Eventual
A cart is inherently session-scoped. Nobody else reads your cart; you only care that your writes are visible to your subsequent reads. This is textbook read-your-writes (also called session consistency). Implementation: the Cart service writes to a Redis primary; reads are served from the same primary (or from a replica that has already acknowledged the write, using a replication offset check). Cart data is ephemeral by design — a 24-hour TTL means a lost write is a minor UX annoyance, not a business catastrophe. Eventual consistency with read-your-writes is the right call here.
Product Catalog — Eventual Consistency + CDN Caching
Product descriptions, images, and pricing change infrequently (a few thousand writes per day across 2 million SKUs). Reads vastly outnumber writes. A 30-second stale price on a product page is acceptable — the checkout step performs a fresh authoritative read before confirming the price. The Catalog service uses multi-leader replication across all three AWS regions (each region has a leader) with last-write-wins conflict resolution on the updated_at timestamp. Static assets are pushed to CloudFront (CDN) with a 5-minute TTL. This topology maximises read throughput globally while tolerating brief cross-region divergence.
User Profiles — Eventual Consistency
Profile changes (shipping addresses, display name, email preferences) are low-frequency and the user's own writes are immediately visible via read-your-writes on the primary. Cross-region replication is asynchronous. If a user updates their address in Singapore and immediately makes a purchase in the US region, the old address might be used on that one order — an acceptable edge case. A compensating mechanism (address confirmation email) handles the rare divergence. Eventual consistency with asynchronous replication is correct here.
Step 3 — The Decision Framework (Generalised)
Use this five-question checklist on any service in any system:
- What is the cost of a stale read? If the answer is "money lost" or "fraud", choose strong consistency. If the answer is "minor UX glitch", choose eventual.
- What is the cost of a lost or duplicated write? If it is irreversible (money moved, stock deducted), use atomic writes with optimistic locking or 2PC. If it is recoverable, idempotency + at-least-once delivery is enough.
- Are participants co-located and few in number? 2PC is viable for 2–3 databases in the same data centre. For anything spanning regions or more than ~5 participants, use Sagas.
- Is the operation long-running (over 100 ms)? Never use 2PC. Use a Saga with compensating transactions. Lock hold time is proportional to saga step latency — that is the key insight.
- What replication lag can the SLA absorb? Translate your business SLA ("users see updated stock within 1 second") into a replication budget. Design read paths (quorum, sync, async) to hit that budget.
Step 4 — Validating Your Design
A consistency strategy is only as good as its failure-mode analysis. For each service, ask: what happens if the primary DB fails right now? and what happens if a cross-region network partition lasts 30 seconds? For ShopFleet:
- Payments primary fails: The synchronous replica is promoted (RDS Multi-AZ, ~30 s). In-flight 2PC transactions that did not complete abort cleanly on coordinator timeout. No data loss.
- Inventory primary fails: CAS writes fail; checkout surfaces "temporarily unavailable". Display reads fall back to the quorum-read replicas. Recovery in ~30 s.
- 30-second partition, us-east ↔ eu-west: Product Catalog diverges — EU users may see slightly different prices. The checkout authoritative read fires against the user's region primary, so no incorrect order is placed. Cart writes in EU queue locally and flush on partition heal.
Running these failure scenarios mentally — or better, in a chaos-engineering tool like Gremlin or AWS Fault Injection Simulator — turns a paper design into a validated architecture.
Summary
The right consistency strategy is not a single setting; it is a per-component decision driven by business impact, participant topology, and latency budgets. ShopFleet uses four distinct models across six services: linearizable + 2PC for money, optimistic-lock quorum for stock, monotonic reads + Sagas for order workflow, and eventual consistency for everything user-browsable. This is the normal, correct state of a well-designed distributed system — not a failure to achieve uniformity.