Tech Insights — Two-phase Commit Protocol for Distributed Transactions
In a distributed system that reads and writes data on multiple nodes, if you still need to ensure the atomicity, consistency, isolation, durability (ACID) features, you must implement distributed transactions. The key to this implementation is an appropriate commit protocol. At present, the two-phase commit protocol (2PC) is the most concise and most widely used protocol for this purpose.
Key Components for Implementing Distributed Transactions
Standalone systems implement local transactions through transaction managers (TMs). Distributed systems coordinate TMs of multiple nodes to collectively commit success or failure. Therefore, transaction coordinators (TCs) are required. A distributed transaction system mainly consists of these two subsystems, which are also called the participant and the coordinator based on their roles in transaction execution.
A local TM implements functions such as concurrency control of local transactions and recovery from exceptions. A TC enables transactions, divides transactions into multiple sub-transactions and distributes them to corresponding nodes for execution, and coordinates transactions until completion (collectively commits success or failure). You can either implement TM and TC in the same process or deploy them on different nodes.
Typical Two-phase Commit Protocol
The process of two-phase commit is simple. When a distributed transaction T is completed, that is, all the nodes that have executed the transaction inform the TC of the execution completion, the TC starts the process of two-phase commit.
Phase 1 — Prepare:
- The TC writes a local
<Prepare T>log and persists it. The TC sends a "Prepare T" message to all participants.
- Each participant TM receives the “Prepare T” message and decides whether to commit the transaction based on its own situation:
- If a TM decides to commit the transaction, it writes a
<Ready T>log and persists it, and then sends a "Ready T" message to the TC.
- If a TM decides not to commit the transaction, it writes an
<Abort T>log and persists it, and sends an "Abort T" message to the TC. Then, the TM enters the transaction abortion process locally.
Phase 2 — Commit:
1. When the TC has received responses from all nodes or the waiting timer times out, the TC decides whether to commit or abort the transaction.
- If all participants respond with a “Ready T” message, the TC writes a
<Commit T>log and persists it, and then sends a "Commit T" message to all participants.
- If the TC receives an “Abort T” response from at least one participant, or if any participant fails to respond within the timeout period, the TC writes a
<Abort T>log, and then sends an "Abort T" message to all participants.
2. After the participants receive the message from the TC, they write
<Commit T> or
<Abort T> logs and persist them.
The 2PC protocol can ensure a key point in the execution of distributed transactions: A participant can decide whether to abort the transaction at any time before it sends a “Ready T” message to the TC. Once this message is sent, the transaction enters the ready state, in which commit and abortion are completely controlled by the TC. When a participant sends a “Ready T” message, it essentially sends a formal and irreversible commitment to the TC. To ensure this commitment, the participant must persist all the necessary information before it sends a “Ready T” message. Otherwise, if the participant crashes after it sends a “Ready T” message, it may be unable to keep the preceding commitment after a restart. In Phase 2, when the coordinator has written a
<Commit T> or
<Abort T> log, the result of the transaction is decided, that is, it will not change anymore.
To optimize the performance of 2PC, it is crucial to reduce the persistence of critical paths and the number of remote procedure calls (RPCs). An optimization method for typical 2PC is as follows:
The TC is stateless and does not persist logs. However, to facilitate transaction state recovery after a restart that follows a crash, the TC must send a list of the participants in the transaction to each participant and persist this list. If this is the case, even if the TC goes down, each participant can easily communicate with other participants to query the transaction state. This method means that each participant performs the task of querying the transaction state during the downtime of the TC.
When all participants have prepared, the transaction will be successfully committed. Therefore, to reduce the delay in committing, the TC can return a success message to the clients on receipt of “Ready” from all participants. However, read requests may need to wait until the commit is completed, therefore increasing the delay of read requests. On the other hand, if the TC returns a success message after it confirms that all participants have committed, there will be a longer delay in committing and shorter delay of read requests.
Exception Handling in 2PC
The normal 2PC process is simple. However, various exceptions in distributed systems must be considered, such as node failures and network partitioning.
1. If the TC detects a participant failure:
- If a participant fails before it sends a “Ready T” message, the TC considers that the node has aborted the transaction and starts the abortion process.
- If a participant fails after it sends a “Ready T” message, it indicates that the participant has persisted the local transaction. The TC ignores the participant failure and proceeds with the transaction process.
2. If a participant fails during transaction committing (TC), the local transaction status is determined based on the content of participant logs during the recovery process.
- If the logs include a
<Commit T>log, it indicates that the transaction has been committed, and the participant runs the REDO(T) command.
- If the logs include an
<Abort T>log, it indicates that the transaction has failed, and the participant runs the UNDO(T) command.
- If the logs include a
<Ready T>log, the participant P communicates with other nodes to query the current transaction state.
- If the TC is normal, it informs the participant P that the transaction has been committed or aborted, and the participant runs the REDO(T) or UNDO(T) command accordingly.
- If the TC is abnormal, the participant P communicates with other participants to query the transaction state.
- If other participants receive the message and know whether the transaction is committed or aborted, they must reply to the participant P with the transaction state.
- If no participant knows the transaction state due to reasons such as destroyed context or its own pending status, the transaction can neither be committed nor aborted. The participant P must regularly communicate with other nodes to query the transaction state until it receives an answer. This is one of the worst scenarios in the 2PC protocol.
- If the logs do not contain any of the preceding logs, it indicates that the participant fails before it sends a “Ready T” message to the TC. If the TC does not receive any response from the participant, it aborts the transaction due to timeout. If this is the case, the participant must abort the transaction during the recovery process.
3. If the TC fails during the transaction commit process, each participant needs to decide the local behavior based on the global transaction state by communicating with other participants. A decision on the transaction status has been made:
- If one or more participants (including the
<Commit T>logs) have committed the transaction, it indicates that the transaction T must be committed.
- If one or more participants (including the
<Abort T>logs) have aborted the transaction, it indicates that the transaction T must be aborted.
A decision on the transaction status has not been made:
- If one or more participants (not including the
<Ready T>logs) are not in the ready state, it indicates that no global consensus has been reached on whether to commit the transaction or not. There are two options: (1) The participants wait for the TC to recover. (2) The participants abort the transaction. The latter option is selected in most cases to decrease the duration of resource occupation.
- If all participants are in the ready state without
<Abort T>logs, no participant knows the current transaction state, so the participants must wait for the TC to recover. In fact, even if there are
<Abort T>logs, log query is time-consuming, and log collection is another consideration. Again, this is one of the worst scenarios in the 2PC protocol.
If all participants are in the ready state and are waiting for the next command from the TC, and an exception occurs on the TC at this point, the participants will keep occupying system resources. If the system has implemented lock-based concurrency control, the participants will also keep holding the locks, and therefore keep other transactions waiting. If this issue exist for a long time, it will have a significant impact on the system. Therefore, the biggest problem for the 2PC protocol is in this scenario: If the TC fails, transactions may be blocked, and final state of the pending transactions cannot be decided until the TC recovers. Meanwhile, if a participant restarts after a crash and plays back these pending transactions, the recovery process will also be blocked due to waiting.
Solutions for Mitigating the 2PC Blocking Problem
The three-phase commit protocol (3PC) is an extension of the 2PC protocol to solve the blocking problem, but it has also introduced other problems. It addresses the blocking problem by introducing a timeout mechanism: If a participant has sent a PreCommit request but does not receive the final DoCommit request from the TC, the participant must wait for the transaction to be automatically committed upon timeout. Obviously, this will cause consistency problems. For example, if the TC goes down when it receives a PreCommit failure from a participant and intends to send an abort request to other participants, the corresponding distributed transaction is supposed to fail. However, some participants may commit it due to timeout.
To address this problem, an additional phase called “CanCommit” has been introduced to 3PC, which is the first phase. In this phase, the TC asks all participants if they can commit, and the participants that are in the normal state respond with a “CanCommit” message. However, these participants do not occupy any system resources at this point. If the TC receives the “CanCommit” message from all participants, it considers all participants to be normal and supposes the later commit to be successful. However, there is still a small probability of failure: After a participant fails to execute operations in the PreCommit phase, both the TC and the participant go down, but other participants commit the transaction upon timeout, resulting in inconsistency.
Therefore, another optimization has been made in 3PC: If the TC goes down, the system quickly selects a new TC to resume the process and try to ensure that no participant will commit the transaction upon timeout. However, in case of exceptions such as network partitioning, the new TC is unable to reach the participants, so consistency problems may still occur.
3PC improves the availability at the cost of reduced consistency and increased network overhead, which is unacceptable for many online transaction processing (OLTP) systems.
However, with high availability of the TC, the blockage time can be significantly reduced. There are high-availability solutions based on consistency protocols such as Paxos or Raft. These solutions allow multiple nodes to reach consensus on whether to commit or abort a transaction before they inform the participants. When an exception occurs on the TC, the system will quickly select a new coordinator to proceed the transaction until completion.