42 E Commerce System Distributed Transaction Adjustment

42 E-commerce system distributed transaction adjustment #

Hello, I’m Liu Chao.

Today’s topic will also start with a case study. Our team once encountered a very serious online incident, where the system occasionally generated abnormal alerts after a DBA completed an online patch for a single database. Our development engineers quickly identified the database abnormality.

The specific situation is this: when a player purchases an item, an exception occurs when deducting the virtual currency. In normal circumstances, after such an exception occurs, the entire purchase operation should be rolled back. However, the severity of this exception is that the virtual currency was not deducted even after the player successfully purchased the item.

The root cause is as follows: the purchased item was updated in the game database, while the virtual currency is stored in the user account center database. In a single purchase, there is a need to simultaneously perform operations on two databases, which is a form of distributed transaction. However, our engineers did not ensure transaction consistency when performing the operations of granting the item and deducting the balance. In other words, when the deduction of virtual currency fails, the purchased game item should be rolled back.

From this case study, I hope you realize the importance of distributed transactions.

Nowadays, most companies have implemented microservices for their services. First, it is due to business needs to decouple services, and second, to reduce mutual impacts between different services.

The same goes for e-commerce systems, where most companies split their e-commerce systems into different service modules, such as product module, order module, inventory module, and so on. In fact, breaking down services is a double-edged sword. It can bring advantages in terms of development, performance, and operations, but at the same time, it can also increase the complexity of business development. The most prominent issue among them is distributed transactions.

Usually, there are two types of deployment for service architectures that involve distributed transactions: different databases for the same service, and different databases for different services. Let’s take a mall as an example and illustrate these two types of deployment:

Typically, we deploy based on the second architecture. So, how can we implement distributed transactions related to order submission in this type of service architecture?

Distributed Transaction Solutions #

We have mentioned before that in the case of a single database, data transaction operations have the four characteristics of ACID. However, if multiple databases are operated within one transaction, it is not possible to guarantee consistency using database transactions.

In other words, when two databases operate on data, it is possible for one database operation to succeed while the other fails. We cannot roll back both data operations using a single database transaction.

Distributed transactions are designed to solve the problem of inconsistent data operations on different nodes within the same transaction. When a transaction operates on multiple services or database nodes, either all requests succeed or all requests fail and roll back. In general, there are multiple ways to implement distributed transactions, such as the two-phase commit (2PC) implemented by the XA protocol, the three-phase commit (3PC), and the TCC compensatory transaction.

Before understanding 2PC and 3PC, it is necessary for us to first understand the XA protocol. XA protocol is a distributed transaction processing specification proposed by the X/Open organization. Currently, only the InnoDB storage engine in MySQL supports the XA protocol.

1. XA Specification #

Before the XA specification, there was a DTP model that specified the model design of distributed transactions.

The DTP specification mainly includes three parts: AP, RM, and TM. AP is the application program, which is where the transaction starts and ends. RM is the resource manager, responsible for managing the connection data source of each database. TM is the transaction manager, responsible for the global management of transactions, including the lifecycle management of transactions and coordination of resource allocation.

XA specifies the communication interfaces between TM and RM, forming a bidirectional bridge between TM and multiple RM, thus ensuring the four characteristics of ACID under multiple database resources.

It is worth noting that JTA is a set of Java transaction programming interfaces based on the XA specification, which is a two-phase commit transaction. We can briefly understand the multi-datasource transaction submission implemented by JTA through the source code.

2. Two-Phase Commit and Three-Phase Commit #

Distributed transactions implemented by the XA specification are two-phase commit transactions, meaning that transactions are committed in two phases.

In the first phase, the application program sends a transaction request to the transaction manager (TM), and the transaction manager then sends a transaction preparation request (Prepare) to each participating resource manager (RM). At this time, these resource managers will open a local database transaction and begin executing the database transaction, but they will not immediately commit the transaction after execution. Instead, they will return a ready or not ready status to the transaction manager. If all participating nodes return a status, it will proceed to the second phase.

In the second phase, if all the resource managers return a ready status, the transaction manager will send commit notifications to each resource manager, and the resource managers will complete the local database transaction commit and return the commit result to the transaction manager.

In the second phase, if any resource manager returns a not ready status, the transaction manager will send rollback notifications to all resource managers, and the resource managers will roll back the local database transaction, release resources, and return the result notification.

However, in practice, two-phase commit transactions also have some flaws.

First, we can see that the resource manager nodes are blocked throughout the process. The transaction manager will not send a notification for global transaction submission until all nodes are ready. If this process takes a long time, many nodes will occupy resources for a long time, thereby affecting the performance of the entire system.

If a resource manager crashes, a situation where it is continuously blocked waiting for notification will occur. This problem can be solved by setting a transaction timeout.

Second, there is still the potential for data inconsistency. For example, when notifying global transaction submission at the end, due to network failures, some nodes may not receive the notification. Since these nodes have not committed the transaction, it will lead to data inconsistency.

The emergence of three-phase commit transactions (3PC) is to reduce the occurrence of such problems.

3PC divides the preparation phase of 2PC into preparation phase and pre-processing phase. In the first phase, each resource node is only asked if it can execute the transaction. In the second phase, all nodes must feedback that they can execute the transaction before the transaction operations begin. Finally, the commit or rollback operations are performed in the third phase. Both the transaction manager and the resource manager introduce a timeout mechanism. If a resource node cannot receive the commit or rollback request from the resource manager in the third phase after the timeout, it will continue to commit the transaction.

Therefore, 3PC can avoid long blocking issues caused by the transaction manager crashing through the timeout mechanism. However, it still cannot solve the problem of not being able to notify some nodes due to network failures during the final global transaction submission, especially rollback notifications. This will result in the transaction waiting to timeout and then performing a default commit.

3. Transaction Compensation Mechanism (TCC) #

The transaction submission based on the XA specification mentioned above has obvious performance issues such as low performance and low throughput due to blocking. Therefore, using such a transaction in a rush-buying activity is difficult to meet the system’s concurrency performance requirements.

In addition to performance issues, JTA can only solve the problem of distributed transactions involving multiple data sources within the same service. In a microservices architecture, when the same transaction operation connects to different data sources in different services, it may encounter difficulties in submitting database operations.

TCC, on the other hand, is a distributed transaction solution designed to address these issues. TCC implements a flexible distributed transaction using eventual consistency. Unlike the two-phase transactions implemented based on the XA specification, TCC is implemented as a two-phase transaction at the service layer.

TCC consists of three phases, namely Try, Confirm, and Cancel.

Try phase: This phase attempts to execute the business logic by calling the Try method in each service, which mainly includes reserving resources.
Confirm phase: This phase confirms the successful execution of the Try methods and calls the Confirm methods in each service through the transaction manager (TM). This phase is the submission phase.
Cancel phase: If any of the Try methods fails during the Try phase, for example, reservation of resources fails or an exception occurs, the TM triggers the Cancel methods in each service to roll back the global transaction and cancel the business logic execution.

The above execution only ensures the submission and rollback of the Try phase when it succeeds or fails. You might wonder how TCC handles exceptions in the Confirm and Cancel phases. In this case, TCC keeps retrying the failed Confirm or Cancel methods until they succeed.

However, TCC compensatory transactions have a significant drawback, which is the high degree of intrusion into the business logic.

First of all, we need to consider reserving resources when designing the business logic. Then, we need to write a large amount of business code, such as the Try, Confirm, and Cancel methods. Finally, we need to consider idempotence for each method. The implementation and maintenance cost of this type of transaction is very high. However, overall, this is currently the most commonly used distributed transaction solution.

4. Non-intrusive Solution - Seata (Fescar) #

Seata is a distributed transaction solution that Alibaba open-sourced last year. It has gained more than ten thousand stars on GitHub in just over a year, indicating its high popularity.

Seata’s basic modeling and Distributed Transaction Processing (DTP) model are similar, but the former further subdivides the transaction manager and extracts a transaction coordinator (TC). The TC mainly maintains the running status of global transactions and coordinates and drives the submission or rollback of global transactions. The transaction manager (TM) is responsible for starting a global transaction and making final decisions to initiate global commit or global rollback. The following diagram illustrates this:

According to the explanation on GitHub, the entire transaction process is as follows:

TM applies to TC to start a global transaction, which is created successfully and assigned a globally unique XID (transaction identifier).
The XID propagates in the context of the microservice invocation chain.
RM registers a branch transaction with TC to make it subject to the governance of the global transaction corresponding to the XID.
TM initiates the global commit or rollback decision for the XID.
TC schedules all branch transactions under the XID to complete submission or rollback requests.

The biggest difference between Seata and other distributed solutions is that Seata commits various transaction operations as early as the first phase. Seata believes that in a normal business scenario, there is a high probability that the transaction submissions of various services will succeed. This transaction commit operation can save the time of holding locks in the two phases, thereby improving overall execution efficiency.

But if the transaction is already committed in the first phase, what about the rollback?

Seata elevates RM to the service layer. It uses a JDBC data source proxy to parse SQL statements and organizes the before and after data of business data updates into rollback logs. It uses the atomicity, consistency, isolation, and durability (ACID) properties of the local transaction to commit both the update of business data and the write of rollback logs in the same local transaction.

If RM decides to roll back globally, it notifies the RM to perform rollback operations. It retrieves the corresponding rollback log records based on the XID and generates reverse-update SQL statements to perform rollback operations.

With the above approach, we can ensure the atomicity and consistency of a transaction. But how is isolation ensured?

Seata uses a global write-exclusive lock maintained by the transaction coordinator to ensure write isolation between transactions, while the default read-write isolation level is uncommitted read.

Summary #

In the case of operating multiple data sources with different databases in the same service, we can use distributed transactions implemented based on the XA specification. In Spring, there is a mature JTA framework that implements the XA specification for two-phase transaction commit. In fact, besides the severe blocking issues in terms of performance, two-phase transactions can also lead to data inconsistency. We should carefully consider the use of two-phase transaction commit.

In the case of cross-service distributed transactions, we can consider using distributed transactions implemented based on the TCC (Try-Confirm-Cancel) principle. The commonly used middleware for this is TCC-Transaction. TCC is also based on the two-phase transaction commit principle, but the second phase of transaction commit in TCC is implemented in the service layer. Although the TCC approach improves the overall performance of distributed transactions, it also brings a significant amount of work to the business layer. It has a strong impact on the application service, but this is the distributed transaction solution adopted by most companies at present.

Seata is an efficient distributed transaction solution designed to solve performance and invasiveness issues caused by distribution. However, the stability of Seata still needs to be verified. For example, when TC notifies RM (Resource Manager) to start committing a transaction, if the connection between TC and RM is disconnected, or the connection between RM and the database is disconnected, it cannot guarantee the consistency of the transaction.

Thought Question #

In the first phase, Seata has already submitted the transaction. If an exception occurs in the second phase and needs to roll back to the Before snapshot, what if another thread has updated the data and the business has completed? Then, won’t the restored snapshot be dirty data? However, in fact, Seata will not encounter this situation. Do you know how it achieves this?

Seata ensures data consistency even when an exception occurs and a rollback is needed during the second phase. It achieves this through the following mechanisms:

Distributed Locking: Seata uses distributed locks to synchronize access to shared resources across multiple threads and processes. When a transaction is in progress, Seata acquires locks to prevent other threads from modifying the data, ensuring that the data remains consistent throughout the transaction.
Isolation Levels: Seata provides isolation levels for transactions, such as READ_COMMITTED and REPEATABLE_READ. These isolation levels define how concurrent transactions access and modify data. By using the appropriate isolation level, Seata prevents dirty reads and ensures that only committed data is visible to other transactions.
Transaction Log: Seata logs all the changes made within a transaction in a transaction log. If an exception occurs and a rollback is needed, Seata uses the transaction log to restore the data to the Before snapshot, ensuring that any changes made after the snapshot are discarded.

By combining distributed locking, appropriate isolation levels, and transaction logs, Seata ensures data consistency and prevents the restored snapshot from being dirty data.