Data Consistency & Replication

The Saga Pattern

18 min Lesson 9 of 10

The Saga Pattern

In the previous lesson you learned that Two-Phase Commit (2PC) provides strong consistency across distributed services — but at a steep cost: all participants must be locked and reachable for the duration of the transaction. In practice, locking an order record, an inventory record, and a payment record across three independent microservices for even 200 ms creates serious contention at scale and makes the system brittle whenever any single service is slow or temporarily unavailable.

The Saga pattern is the practical alternative. Instead of a single atomic distributed transaction, a saga breaks the business operation into a sequence of local transactions, each of which updates one service and publishes an event or sends a command to trigger the next step. If any step fails, the saga executes a series of compensating transactions — actions that semantically undo the work already committed — rolling the system back to a consistent state without ever holding distributed locks.

Key idea: A saga trades atomicity (all-or-nothing in a single instant) for eventual consistency (all services converge to the correct state after a finite number of steps, including compensation). This is an explicit, deliberate trade-off that is almost always the right choice in microservice architectures.

A Concrete Example: E-commerce Order Placement

Consider placing an order that spans four independent services: Order Service, Inventory Service, Payment Service, and Notification Service. The happy path requires all four steps to succeed; a failure in any step must undo the preceding committed work.

The forward steps are:

Order Service — create an order record with status PENDING.
Inventory Service — reserve the requested stock (subtract from available quantity).
Payment Service — charge the customer's payment method.
Notification Service — send a confirmation email; mark order CONFIRMED.

The corresponding compensating transactions (executed in reverse order on failure) are:

Mark order as CANCELLED.
Release the reserved inventory (add quantity back).
Issue a refund to the payment method.
Send a cancellation email.

Left: the happy path — each step publishes a success event to trigger the next. Right: when Payment fails, compensating transactions run in reverse to release inventory and cancel the order.

Choreography vs Orchestration

There are two fundamentally different ways to coordinate the steps of a saga:

Choreography (event-driven): each service listens for events on a message bus and reacts independently. The Order Service publishes OrderCreated; the Inventory Service consumes it and publishes StockReserved; the Payment Service consumes that and so on. No central coordinator exists — the saga emerges from the interactions.

Orchestration (command-driven): a dedicated Saga Orchestrator (often implemented as a state machine) explicitly commands each participant in sequence, waits for their reply, and decides what to do next — including issuing compensations if a step fails. The orchestrator is the single source of truth about the saga's current state.

Choreography: services emit and react to events on a shared bus — no central coordinator. Orchestration: a dedicated state machine issues commands to each participant and drives compensation on failure.

Choreography: Trade-offs

Pros: Loose coupling — services do not know about each other; easy to add new steps by subscribing to existing events; no single point of failure for coordination.
Cons: Hard to understand the overall flow — it is implicit and spread across many services; difficult to debug when something goes wrong mid-saga; compensations must also be event-driven, making the interaction web complex.

Choreography works well for simple, linear sagas with 2–4 steps where the event contract is already well-defined (e.g., a webhook fan-out).

Orchestration: Trade-offs

Pros: The saga flow is explicit and centralised — easy to read, monitor, and debug; rollback logic is co-located with the forward logic; the orchestrator's state machine is a natural audit log.
Cons: Introduces a new component (the orchestrator) that must be highly available; risks becoming a "god service" if poorly scoped; can create tighter coupling through direct command channels.

Orchestration is the dominant choice in production microservice systems with complex, multi-step, long-running flows. Frameworks like Apache Camel, Conductor (Netflix), Temporal, and AWS Step Functions implement it as a first-class primitive.

Best practice: Make saga state durable. Persist the orchestrator's state to a database before issuing each command. If the orchestrator crashes and restarts, it can resume from exactly the last known step — no transactions are lost and no compensation is triggered incorrectly. Temporal.io, for example, stores every workflow event in an append-only event log that survives process restarts.

Designing Compensating Transactions

Compensating transactions are not rollbacks in the database sense — the original transaction already committed. A compensation is a new, forward-moving transaction that semantically reverses the business effect.

Three properties every compensation must satisfy:

Idempotent — the compensation can be executed multiple times safely. If the network retries delivering the compensation command, the second execution must be a no-op (e.g., "if stock is already released, do nothing").
Commutative where possible — compensations should not depend on a specific ordering of concurrent events, because message delivery order is not guaranteed in distributed systems.
Complete — every forward step that mutates state must have a well-defined compensation. If you cannot define a compensation, that step cannot be part of a saga (consider using 2PC or a synchronous call instead).

Pitfall — "Pivot transactions": Some steps cannot be compensated after they occur. Sending an email or triggering a bank transfer are pivot transactions — once executed, they cannot be undone. Place pivot transactions as the last step in your saga so that compensations for all preceding steps are still possible if the pivot itself fails before completing.

Isolation and the "Dirty Read" Problem

Unlike ACID transactions, sagas provide no isolation between concurrent saga instances. While saga A is reserving inventory and charging payment, saga B can read the intermediate state (e.g., stock showing as reserved but payment not yet confirmed). This is the "dirty read" problem in distributed sagas.

Mitigation strategies:

Semantic locking — mark resources with a "in-progress" flag (e.g., status = PENDING_PAYMENT) that other sagas check before proceeding.
Commutative updates — design updates so that order does not matter (e.g., incrementing/decrementing counters rather than setting absolute values).
Re-read values — before the final step, re-read critical fields and abort the saga if they have changed unexpectedly since the saga started.

When to Use the Saga Pattern

Any operation that spans two or more microservices with independent databases and a business requirement for consistency.
Long-running workflows (travel booking: flights + hotel + car; loan origination: credit check + score + approval + disbursement) where holding a distributed lock for the full duration is not acceptable.
When the individual steps are already idempotent or can be made so.

Sagas are the industry-standard consistency mechanism for microservice architectures. Every major e-commerce platform, fintech system, and ride-sharing service uses some variant of this pattern. Mastering it — along with understanding its isolation trade-offs and the discipline required for correct compensations — is one of the most practically valuable skills in distributed systems design.