19 Distributed Transactions the Ultimate Showdown Between Spanner and Calvin

19 Distributed Transactions - The Ultimate Showdown Between Spanner and Calvin #

In the previous lecture, we introduced the most important concept of distributed transactions - atomic commits, and discussed the two-phase, three-phase commit and Percolator models.

In this lecture, I will reveal the two most famous distributed transaction models in the industry, and the heated debates between their authors and followers have made these models even more legendary. Let’s see in this lecture who will ultimately win?

First, let me introduce the two “contestants” - Spanner and Calvin. They are backed by widely referenced papers, and both have a solid theoretical foundation. So let’s start with Spanner.

Spanner and its Followers #

Spanner originated from a Google paper and eventually became a service in Google Cloud. In simple terms, Spanner is an implementation of two-phase commit. As we discussed in the previous lecture, I introduced four failure scenarios of the two-phase commit, one of which is the unresponsive participant during the prepare phase, resulting in reduced availability of the transaction. Spanner solves this problem by using a consensus algorithm to ensure that each shard is highly available, thereby improving the overall availability of the transaction.

The overall architecture of Spanner is quite complex and contains a lot of content. But the core consists of two main parts: TrueTime and Paxos Group, and the debate between the two models is mainly about one of these parts.

TrueTime #

I mentioned this in Module 3 “13 | Overview: What problems do distributed systems solve?”. There are two ways for distributed systems to obtain time: physical time and logical time. Due to the unreliability of physical time, most distributed systems use logical time. Logical time is often generated by a node as a timestamp, which is efficient enough. However, if you want to build a global system, this design has its limitations.

TrueTime, on the other hand, is a fusion of logical and physical time, generated by atomic clocks combined with local time in IDC. Unlike traditional single points in time, TrueTime returns a time interval, and data operations may occur within this interval, hence the data state within the interval is uncertain (uncertainty). The system must wait for a period of time to obtain a definite system state. This period of time is usually short and multiple operations can be executed in parallel, usually not affecting the overall throughput.

Transaction Process #

Spanner provides three transaction modes.

Read-write transaction: This transaction is implemented using distributed locks, and its concurrency is the worst. Data is written to the leader node of each Paxos Group.
Read-only transaction: This transaction is lock-free and can be read from any replica set. However, to read the latest data, you need to read from the leader node. The leader node can obtain the latest committed timestamp from the Paxos Group.
Snapshot Read: As the name suggests, Spanner implements MVCC and snapshot isolation, so read operations within the entire transaction are consistent. At the same time, this implies that Spanner can store multiple versions of the same data.

Now that we understand the transaction model, let’s delve into its internals and see what the core components of Spanner are. Here is an architecture diagram of Spanner.

Drawing 0.png In the above content, we can see that each replica stores multiple tablets, and these replicas form a Paxos Group. The Paxos Group elects a leader to coordinate with the leaders of other Paxos Groups in multi-shard transactions (I will introduce the details of the Paxos algorithm in the next lecture).

Write operations must be performed through the leader, while read operations can be performed on any replica that has been synchronized. We can also see that the leader has a lock manager, which is used to implement the lock management mentioned in concurrency control. The transaction manager is used to handle multi-shard distributed transactions. When performing synchronous write operations, a lock must be acquired, while snapshot read operations are lock-free.

The most complex operation is the write operation involving multiple shards, which follows a two-phase commit process involving the leader. In the prepare phase, the submitted data is written to the coordinator’s Paxos Group, which solves the following two problems:

The data of the entire transaction is secure. The crash of the coordinator will not affect the continuation of the transaction, and we can recover the transaction data from the Paxos Group.
The crash of a participant will not affect the transaction. Because the Paxos Group can choose new nodes to continue executing unfinished transaction operations.

In terms of isolation, Spanner implements SSI, which stands for serializable snapshot isolation. The method is the lock table mentioned earlier. This lock is a full exclusive lock that not only prevents concurrent writes, but also can prevent reads, thereby solving the problem of write skew in snapshot isolation.

Throughout the process, the start time of the transaction and the commit time (data visibility time) are obtained through TrueTime. After Spanner obtains these ranges, it must wait for the time described in the range before executing the operation. Otherwise, the system will read inconsistent data. For example, it may fail to read data before the current time or read abnormal data generated by a partial transaction.

At the same time, Spanner declares its transaction characteristics as external consistency. It is described as follows: firstly, concurrent transactions are serialized, as shown above, Spanner implements SSI. At the same time, it is linearly consistent, which means that if transaction A is submitted before transaction B in “real” time, then the time of transaction A must be less than that of transaction B. For those who have a deeper understanding of consistency, they will find the relationship between transactions and consistency mentioned in this section. Any distributed database needs to describe its transaction characteristics (concurrent operations) and consistency characteristics (non-concurrent operations), and the so-called external consistency of Spanner is serialization + linear consistency.

Spanner is not only a commercial product available from Google Cloud, but also there are many open source databases designed based on the ideas of Spanner, such as CockroachDB, YugaByte DB, etc. Therefore, Spanner is considered a mature solution that spans from open source to commercial, from on-premise deployment to the cloud.

Above, I have explained the characteristics of Spanner. Now let’s take a look at some features of its competitor, Calvin.

Calvin and FaunaDB #

Spanner has introduced many new technologies to improve the performance of distributed transactions. However, we have found that its overall process is still based on the traditional two-phase commit, and there have not been major structural changes. On the other hand, Calvin is full of disruptive innovations. Let’s see how it handles distributed transactions.

Firstly, traditional distributed transaction processing uses locks to ensure that transactions with concurrent competition satisfy the constraints of isolation levels. For example, the serializable level guarantees that transactions are executed one after another. The execution order of each replica cannot be predicted, but the results can be predicted. Calvin’s solution is to make the execution order of transactions on each replica consistent, so the results of execution will definitely be consistent. The benefit of doing this is to avoid lock contention among numerous transactions, thus greatly improving the throughput of high-concurrency transactions. At the same time, node failures do not affect the execution of transactions. Because the steps of the transaction execution have been allocated, the node can resume running the transaction from the failure point after recovery. This mode greatly improves the availability of distributed transactions. Currently, FaunaDB is the database that implements the Calvin transaction model.

Secondly, the component that sorts transactions is called the sequencer. It collects transaction information and then breaks them down into smaller epochs. The purpose of this is to reduce lock contention and increase parallelism. Once the transactions are prepared, the sequencer sends them to the scheduler. The scheduler executes some transaction steps in parallel at the appropriate time based on the results processed by the sequencer, while ensuring that sequentially executed steps are not executed in parallel. Since these steps have been sequenced, the scheduler does not need to interact with the sequencer during execution, thereby improving execution efficiency. The processing components of Calvin transactions are shown in the following diagram.

Drawing 1.png

Calvin also uses the Paxos algorithm, but unlike Spanner, each shard has a Paxos Group. Calvin uses Paxos or asynchronous replication to determine which transactions need to enter which epoch.

Calvin transactions also have the concept of read set and write set. The former represents the data that the transaction needs to read, and the latter represents the data affected by the transaction. These two sets need to be determined before the transaction starts. Therefore, Calvin does not support querying dynamic data in a transaction and then affecting the final result set. This point is very important and is the core of this battle. After you have understood the two transaction models, I will take you into the “Battlefield”. Amongst two equally strong participants, Calvin faction first initiated the war.

Criticism of Spanner #

Professor Daniel Abadi from the University of Maryland, co-author of Calvin’s paper and consultant for FaunaDB, is qualified to represent the Calvin faction in challenging Spanner.

At the beginning, Professor Abadi mainly discussed the performance differences between Spanner and Calvin’s architectures, comparing them in the following aspects:

Traditional read-write transactions: If it is a transaction within a shard (non-distributed scenario), the performance of both is similar; however, for cross-shard transactions, he believes that Calvin’s performance is far better than Spanner’s. The reasons are that Spanner has two performance losses: firstly, TrueTime returns a time range, so we have to wait for a period of time before we can perform commit operations, of course this part can be parallelized; secondly, Spanner uses a two-phase commit, which theoretically results in higher latency compared to Calvin’s “one-phase” commitment.
Snapshot reads: In this aspect, both have similar principles, so the latency is not high.
Read-only transactions: In this aspect, Spanner is more efficient because it only reads data from the leader node, while Calvin performs globally consistent reads, resulting in higher latency.

In addition to the above comparisons, Calvin also has an advantage in log replication. The main reason is that Spanner’s log replication follows the Paxos process, while Calvin, with the help of preprocessing, can perform replication easily and efficiently. This advantage theoretically becomes more apparent as the physical distance between nodes expands.

Of course, we know that Calvin mentioned that its preprocessing mechanism limits the operations within a transaction, and Professor Abadi also noticed this limitation.

The above is Professor Abadi’s comparison of the performance between the two, which is relatively objective and neutral, without strong conflict. However, he immediately pointed out a very controversial issue of Spanner, which relates to TrueTime. Since TrueTime has not been proven to be free from time skew at the theoretical level but has only been shown to have a very low probability through a large amount of engineering practice, this probability becomes an attack point.

Professor Abadi is smart, or one could say wise, here. He did not attack TrueTime itself, but instead indicated that TrueTime, due to its reliance on hardware such as atomic clocks, makes it more difficult for others to replicate this technology. This leads to a topic that has long been discussed in the tech community - Google’s technology becomes ineffective once it leaves Google.

And what Abadi wants to challenge is other open source or commercial databases based on the idea of Spanner, such as CockroachDB and YugaByteDB mentioned earlier. Their TrueTime is implemented through software, which increases the probability of time skew described above compared to hardware. CockroachDB is fine, as it declares the possibility of such anomalies, while YugaByte does not, so it becomes the main target of Professor Abadi’s attack.

Finally, the professor mentioned that Calvin and FaunaDB have theoretically proven their ability to achieve consistency.

Since Calvin started the war, especially focusing on YugaByteDB, the latter has launched a counterattack.

Counterattack of Spanner followers #

Since trouble comes from YugaByte, it is only natural for them to launch a counterattack.

In the previous text, the professor’s viewpoints can be summarized as follows:

In terms of performance, Calvin has a higher throughput than Spanner due to its shorter lock holding time.
In terms of consistency, hardware-based TrueTime has a certain probability of time reversal, and software-based “TrueTime” cannot guarantee monotonically increasing time.

Regarding the first issue, YugaByte first acknowledges the advantage of Calvin’s throughput. However, the situation changes when YugaByte presents its famous research on distributed transaction patterns. The research analyzes the transaction patterns used by multiple AWS Dynamo users. The conclusion is that 90% of transactions occur on a single row and a single shard, with only about 10% occurring on multiple shards. Based on this, YugaByte refers to the former as the primary workload and the latter as the secondary workload.

In terms of the primary workload, the professor mentioned earlier also admits that there is no significant difference in performance between Spanner and Calvin. Calvin’s advantage now becomes the secondary workload. Just as we have heard, “talking about toxicity without mentioning dosage is being rogue.” Calvin’s advantage lies in the secondary workload, which greatly reduces the importance of this advantage.

The second issue is actually the core issue. I appreciate that YugaByte does not shy away from it and openly acknowledges that software implementations of TrueTime, like YugaByte’s, cannot achieve strict serialization like Calvin. Instead, they achieve “maximally possible” serialization. Once the TrueTime range exceeds a threshold, serialization is broken. However, YugaByte points out two points for users to consider:

Both Spanner-like solutions and Calvin have no consistency issues in the primary workload mentioned earlier, only in the secondary workload.
With the gradual availability of atomic clock services from AWS, Alibaba Cloud, and other public cloud services, databases like YugaByte can also use true TrueTime, greatly reducing the probability of time reversal.

From the above explanations, it is clear that there are indeed problems with software NTP timers, but they can still be used if the user’s scenario does not require strictness.

In addition to addressing the issues raised by the professor, YugaByte also points out some “fatal” flaws in Calvin-like databases.

The professor mentioned earlier has already admitted that Calvin’s read performance is weaker than Spanner’s.
The static write set and read set lead to issues with secondary indexes and session-level transactions. As mentioned earlier, Calvin’s transaction writes cannot rely on reads within the transaction. If the columns of secondary indexes are frequently modified, Calvin’s transactions will be retried repeatedly, reducing throughput.
Another drawback of Calvin is its lack of an open-source implementation. Currently, only FaunaDB, a closed-source commercial version, is available, which means users accustomed to using open-source technology stacks have no other choices.
FaunaDB does not use SQL but instead uses a new language called FQL, which has a GraphQL-like style. This poses a significant challenge for teams accustomed to using the SQL language to switch to FaunaDB.

It can be seen that the YugaByte team has provided their response to the criticisms, but is there a definite outcome to their debate?

Who is the winner? #

From the current perspective of development, neither side can completely replace the other. Calvin has a clear advantage in highly competitive transaction scenarios, while Spanner’s advantages in reads and session-level transactions are irreplaceable. From their principles, neither can ultimately win. However, we do not expect a final winner, but rather hope that future transaction models can draw inspiration from these two models to bring us more efficient solutions to distributed transactions.

With this, we have covered two lectures worth of content, providing a detailed introduction to database-oriented distributed transactions. The next lecture will cover the final topic of Module Three: consensus algorithms. They are the core algorithms of modern distributed systems, and I hope to see you on time for that.