The Saga Pattern
The Saga Pattern
In the previous lesson you learned that Two-Phase Commit (2PC) provides strong consistency across distributed services — but at a steep cost: all participants must be locked and reachable for the duration of the transaction. In practice, locking an order record, an inventory record, and a payment record across three independent microservices for even 200 ms creates serious contention at scale and makes the system brittle whenever any single service is slow or temporarily unavailable.
The Saga pattern is the practical alternative. Instead of a single atomic distributed transaction, a saga breaks the business operation into a sequence of local transactions, each of which updates one service and publishes an event or sends a command to trigger the next step. If any step fails, the saga executes a series of compensating transactions — actions that semantically undo the work already committed — rolling the system back to a consistent state without ever holding distributed locks.
A Concrete Example: E-commerce Order Placement
Consider placing an order that spans four independent services: Order Service, Inventory Service, Payment Service, and Notification Service. The happy path requires all four steps to succeed; a failure in any step must undo the preceding committed work.
The forward steps are:
- Order Service — create an order record with status
PENDING. - Inventory Service — reserve the requested stock (subtract from available quantity).
- Payment Service — charge the customer's payment method.
- Notification Service — send a confirmation email; mark order
CONFIRMED.
The corresponding compensating transactions (executed in reverse order on failure) are:
- Mark order as
CANCELLED. - Release the reserved inventory (add quantity back).
- Issue a refund to the payment method.
- Send a cancellation email.
Choreography vs Orchestration
There are two fundamentally different ways to coordinate the steps of a saga:
Choreography (event-driven): each service listens for events on a message bus and reacts independently. The Order Service publishes OrderCreated; the Inventory Service consumes it and publishes StockReserved; the Payment Service consumes that and so on. No central coordinator exists — the saga emerges from the interactions.
Orchestration (command-driven): a dedicated Saga Orchestrator (often implemented as a state machine) explicitly commands each participant in sequence, waits for their reply, and decides what to do next — including issuing compensations if a step fails. The orchestrator is the single source of truth about the saga's current state.
Choreography: Trade-offs
- Pros: Loose coupling — services do not know about each other; easy to add new steps by subscribing to existing events; no single point of failure for coordination.
- Cons: Hard to understand the overall flow — it is implicit and spread across many services; difficult to debug when something goes wrong mid-saga; compensations must also be event-driven, making the interaction web complex.
Choreography works well for simple, linear sagas with 2–4 steps where the event contract is already well-defined (e.g., a webhook fan-out).
Orchestration: Trade-offs
- Pros: The saga flow is explicit and centralised — easy to read, monitor, and debug; rollback logic is co-located with the forward logic; the orchestrator's state machine is a natural audit log.
- Cons: Introduces a new component (the orchestrator) that must be highly available; risks becoming a "god service" if poorly scoped; can create tighter coupling through direct command channels.
Orchestration is the dominant choice in production microservice systems with complex, multi-step, long-running flows. Frameworks like Apache Camel, Conductor (Netflix), Temporal, and AWS Step Functions implement it as a first-class primitive.
Designing Compensating Transactions
Compensating transactions are not rollbacks in the database sense — the original transaction already committed. A compensation is a new, forward-moving transaction that semantically reverses the business effect.
Three properties every compensation must satisfy:
- Idempotent — the compensation can be executed multiple times safely. If the network retries delivering the compensation command, the second execution must be a no-op (e.g., "if stock is already released, do nothing").
- Commutative where possible — compensations should not depend on a specific ordering of concurrent events, because message delivery order is not guaranteed in distributed systems.
- Complete — every forward step that mutates state must have a well-defined compensation. If you cannot define a compensation, that step cannot be part of a saga (consider using 2PC or a synchronous call instead).
Isolation and the "Dirty Read" Problem
Unlike ACID transactions, sagas provide no isolation between concurrent saga instances. While saga A is reserving inventory and charging payment, saga B can read the intermediate state (e.g., stock showing as reserved but payment not yet confirmed). This is the "dirty read" problem in distributed sagas.
Mitigation strategies:
- Semantic locking — mark resources with a "in-progress" flag (e.g.,
status = PENDING_PAYMENT) that other sagas check before proceeding. - Commutative updates — design updates so that order does not matter (e.g., incrementing/decrementing counters rather than setting absolute values).
- Re-read values — before the final step, re-read critical fields and abort the saga if they have changed unexpectedly since the saga started.
When to Use the Saga Pattern
- Any operation that spans two or more microservices with independent databases and a business requirement for consistency.
- Long-running workflows (travel booking: flights + hotel + car; loan origination: credit check + score + approval + disbursement) where holding a distributed lock for the full duration is not acceptable.
- When the individual steps are already idempotent or can be made so.
Sagas are the industry-standard consistency mechanism for microservice architectures. Every major e-commerce platform, fintech system, and ride-sharing service uses some variant of this pattern. Mastering it — along with understanding its isolation trade-offs and the discipline required for correct compensations — is one of the most practically valuable skills in distributed systems design.