Distributed Transactions & Two-Phase Commit
Distributed Transactions & Two-Phase Commit
A distributed transaction is an operation that must succeed or fail atomically across two or more independent services or databases. The classic example: a payment system that must (1) debit the customer account, (2) credit the merchant account, and (3) update an inventory record — all on separate databases. If step 2 succeeds but step 3 crashes, money has moved but stock is wrong. That is the core problem distributed transactions solve.
Why Local ACID Is Not Enough
Within a single database, the engine gives you ACID for free: a BEGIN … COMMIT either applies every write or rolls them all back. Across services, there is no shared transaction log. Each service can only see its own data store. Coordinating them requires a protocol that both participants can trust even when the network fails mid-way.
Two-Phase Commit (2PC)
Two-Phase Commit is the canonical solution. It introduces a neutral Coordinator (often the service that initiates the transaction) and one or more Participants (each owning a resource — database, queue, cache). The protocol runs in two phases:
- Phase 1 — Prepare (Voting): The coordinator sends a
PREPARErequest to every participant. Each participant writes the intended changes to a durable write-ahead log, locks the relevant rows, and repliesYES(ready to commit) orNO(abort). - Phase 2 — Commit or Abort: If all participants voted YES, the coordinator logs
COMMITand sendsCOMMITto everyone. If any participant voted NO or timed out, the coordinator logsABORTand sendsROLLBACK. Participants apply or undo their changes and release locks.
The Blocking Problem
2PC is a blocking protocol. Consider what happens if the coordinator crashes immediately after logging COMMIT but before sending the COMMIT message to participants. The participants are stuck: they have locked their rows and cannot decide whether to commit or abort on their own. They must wait until the coordinator recovers. During that window, those rows are unavailable — potentially for minutes or longer.
Real-World Usage: XA Transactions
The XA standard (defined by the Open Group) is the most common implementation of 2PC you will encounter. MySQL, PostgreSQL, Oracle, and IBM MQ all support XA. Java applications use it via javax.transaction.XAResource. The coordinator role is typically played by a transaction manager such as Atomikos, Narayana, or a JTA-compatible application server.
Example flow in a banking microservices context: the Payment Service acts as coordinator; it opens an XA transaction spanning the Accounts DB (PostgreSQL) and the Ledger DB (MySQL). Both databases perform their prepare, the coordinator commits, and both apply. From the user's perspective, the debit and credit appear simultaneously.
Performance Characteristics
A single 2PC round trip adds at minimum 2 × network RTT + 2 × disk fsync latency to every transaction. At 5 ms RTT and 1 ms fsync, that is roughly 12 ms of pure overhead before any application logic runs. For high-throughput systems (thousands of transactions per second), this compounds quickly. Locks held across the network further reduce concurrency. These costs explain why modern microservice architectures prefer eventual-consistency patterns (Sagas, covered in the next lesson) over 2PC for long-running or cross-service operations.
Three-Phase Commit (3PC) — a Brief Note
Three-Phase Commit (3PC) attempts to solve the blocking problem by inserting a pre-commit phase between prepare and commit. If the coordinator crashes after pre-commit, participants can safely make a decision without waiting. However, 3PC introduces a new failure mode under network partition (split-brain commit vs. abort) and is rarely used in practice. Most production systems stay with 2PC and invest in high-availability coordinators rather than adopting 3PC's added complexity.
Alternatives and When to Use Them
Because of 2PC's blocking and performance costs, the industry has converged on these alternatives for most microservice scenarios:
- Outbox Pattern + CDC: Write the event to an
outboxtable in the same local transaction as the business data. A Change-Data-Capture process (e.g., Debezium) tails the WAL and publishes the event. Atomic within one DB; eventual across services. - Sagas (Choreography / Orchestration): Break the distributed transaction into a sequence of local transactions, each publishing an event. Compensating transactions handle rollback. Covered in Lesson 9.
- Google Spanner / CockroachDB TrueTime: Distributed databases that implement their own distributed commit protocol internally, hiding the complexity from the application and achieving serializable isolation at global scale.
Key Takeaways
- 2PC coordinates an atomic commit across multiple independent participants using a Prepare → Commit/Abort protocol.
- The coordinator is a single point of failure; its log must be durable so it can retransmit decisions after a crash.
- Participants hold locks between Phase 1 and Phase 2, creating a blocking window proportional to coordinator recovery time.
- XA is the standard implementation; it is well-supported in relational databases and enterprise middleware.
- For most modern microservice architectures, the Outbox Pattern and Sagas are preferable — they trade strict atomicity for availability and scalability.