21 Knowledge Series How to Balance Performance and Scalability

21 Knowledge Series - How to Balance Performance and Scalability #

In this lesson, we will summarize Module 3. After studying this module, I believe you have gained a deep understanding of the common techniques used in distributed systems within distributed databases. This lesson is designed to help you connect the knowledge you have learned together, and I will illustrate how these knowledge are applied in distributed databases through several examples.

Knowledge Summary #

At the beginning of this module, I emphasized the most important issue that distributed systems need to address: failure. Failure is an omnipresent phenomenon in distributed systems, which can result from network connectivity problems caused by network fluctuations, as well as the stability of the distributed nodes themselves, affecting the overall stability of the system. In “Lesson 13: Overview: What problems do distributed systems need to solve,” we defined failure models, which include crash failures, omission failures, and arbitrary failures.

Among them, distributed databases usually focus on studying the first two types of failures, as the assumption of arbitrary failures involves nodes forging data, which is not a common scenario for distributed databases built on secure networks. The reasons for failures become increasingly difficult to predict from top to bottom in the failure models, and the difficulty of handling them also increases. Most of the technologies and algorithms introduced in this module are designed to deal with the first type of failure scenarios, while consensus algorithms are mainly used to address the second type of failure scenarios.

Afterwards, we introduced detection methods for failures, where heartbeat-based algorithms are widely used in distributed databases. For peer-to-peer distributed databases without a master node, failure detection based on the Gossip algorithm is used, such as in Apache Cassandra.

After discussing failure detection, I mentioned the master-slave pattern in distributed systems. Distributed databases with a master node have advantages such as high performance and predictable states compared to fully peer-to-peer distributed databases. In this section, we focused on introducing leader election schemes. In the leader election algorithms, I introduced the Bully algorithm and its variants, covering the main means of leader election. The influence of the Bully algorithm can be seen in the leader election schemes of consensus algorithms such as ZAB and Raft. After introducing the leadership election, we continued to expand on the topic of replication and consistency in Module 1. We provided additional explanations on client consistency and eventual consistency. With that, we have covered all the content related to consistency in our curriculum. In “16 | Consistency Beyond CAP: What Other Consistency Models Are There?” I introduced a complete consistency model tree to help you establish a comprehensive understanding of consistency. Client consistency and eventual consistency are often used together and are well-suited for AP-style databases, such as AWS DynamoDB, which is a type of distributed database with no master node that can be deployed globally across regions and is generally a NoSQL database. Apart from DynamoDB, other typical examples of such databases include Apache Cassandra and Azure Cosmos.

Because eventual consistency allows temporary inconsistency between data on different nodes, based on the theory of entropy increase, this inconsistency will gradually expand over time, eventually rendering the eventual consistency database completely unavailable. For this reason, we introduced the concept of anti-entropy. Anti-entropy mechanisms can be categorized into foreground and background methods. The former includes techniques such as read repair and hint transmission queues, while the latter includes Merkle trees and bitmap version vectors. In the “14 | Fault Detection: How to Ensure Stability in Distributed Systems” module, the Gossip protocol mentioned not only detects abnormal conditions in the system but also, more importantly, reliably delivers messages throughout the entire distributed system, thus achieving the goal of anti-entropy. The Gossip protocol is well-suited for large-scale distributed databases and is used by technologies such as Apache Cassandra and Redis.

Distributed transactions are the focal point of this module. I dedicated two lectures to introduce them. First, I discussed the most typical atomic commit transactions, namely two-phase and three-phase commit protocols. Although the three-phase commit protocol improves upon the issues of the two-phase commit protocol, it is rarely used in real-world scenarios due to its low performance and the potential occurrence of brain splits. Next, we introduced snapshot isolation and serializable snapshot isolation, which are the most important isolation levels in modern distributed databases, providing support for lock-free, high-performance distributed transactions. The Percolator transaction model is an efficient, optimistic, lock-free transaction solution based on these isolation levels. Currently, TiDB utilizes this solution for its optimistic transactions.

Distributed transactions have always been a hot topic of theoretical innovation in the field of distributed databases. In this module, I compared and introduced two innovative transaction models: Spanner and Calvin. The former combines Paxos groups with TrueTime and has achieved remarkable results in terms of balance. The latter left a deep impression on us in terms of throughput for high-contention transactions. The theoretical debate between them has shed light on their strengths and weaknesses, which is very meaningful for better understanding and developing the theory of distributed transactions. We also look forward to whether future distributed transaction theories can produce even better solutions.

In the last lecture of this module, “20 | Consensus Algorithms: Understanding the Differences Between Paxos, Raft, and Other Algorithms,” we introduce the definitive achievement of distributed system theory—the distributed consensus algorithm. The consensus algorithm can be regarded as a combination of failure modeling, failure detection, leader election, and consistency. Through a single algorithmic system, it achieves multiple functions in distributed systems as described above. It provides strong assistance for building distributed databases, and many solutions in the fields of distributed transactions and data replication use consensus algorithms. Currently, the most commonly used consensus algorithm in distributed databases is Raft, which is a simplified version of Multi-Paxos. Its ease of implementation and fast recovery make it particularly attractive to maintainers of distributed databases.

With that, we have completed the entire content of this module. Next, I will apply the knowledge learned in practice through several specific case studies. In the case study section, I have chosen three representative databases: TiDB, Alibaba’s PolarDB-X, and Apache Cassandra. They are currently typical examples of distributed databases, and we will see how they apply the knowledge points covered in this module.

TiDB: Building Pessimistic Transactions with Optimistic Transactions #

In the lecture on distributed transactions, I mentioned that TiDB’s optimistic transactions use Google’s Percolator model, and TiDB has also made improvements to this model. It can be said that TiDB is inseparable from Percolator-based transactions in both domestic and international databases.

TiDB’s overall architecture is based on Google Spanner and F1, and consists of two layers: TiDB and TiKV. TiDB corresponds to Google F1 and is a stateless SQL layer that is compatible with most MySQL syntax. It exposes the MySQL network protocol and is responsible for parsing user SQL statements, generating distributed query plans, and translating them into underlying key-value operations to be sent to TiKV. TiKV is the actual storage for data and corresponds to Google Spanner. It is a distributed key-value database that supports elastic horizontal scaling, automatic disaster recovery and failover, as well as ACID cross-row transactions. The following diagram shows the architecture of TiDB.

Drawing 0.png

In terms of transactions, TiDB’s implementation of pessimistic transactions is very simple. After carefully studying the Percolator model, the TiDB team found that by slightly modifying the behavior of two-phase commit when the client calls Commit, and moving the locking and waiting steps to the process of executing DML within the transaction, the support for pessimistic transaction scenarios can be achieved easily and efficiently.

The principle behind TiDB’s implementation of pessimistic locks is that during the process of executing DML (UPDATE/DELETE) in a transaction, TiDB not only caches the rows that need to be modified locally, but also directly locks these rows pessimistically. The format of pessimistic locks is almost identical to that of optimistic transactions, but the content of the locks is empty, serving as placeholders. When committing, these pessimistic locks are directly modified into standard Percolator model locks, and the subsequent process remains the same as before.

This solution is compatible with the original transaction implementation to a large extent, ensuring scalability, high availability, and flexibility. At the same time, this solution maximally reuses the optimistic transaction solution of the original Percolator, reducing the overall complexity of the transaction model.

So far, I have explained how TiDB uses the Percolator model and its variants to implement optimistic and pessimistic transactions. Next, I will introduce how Alibaba’s PolarDB-X utilizes consensus algorithms to build a geographically distributed multi-active distributed database.

PolarDB-X: Building a Geographically Distributed Multi-Active System Using Paxos #

With the rapid growth of its business, “geographically distributed multi-active” has become the new standard for Alibaba’s applications. Driven by this business background, PolarDB-X initially implemented consistency based on single-node MySQL, and partially solved the business demands with the TDDL sharding strategy. This module is called X-Paxos. With the continuous development and evolution of technology, as well as the widespread adoption of the cloud era, PolarDB-X 2.0 combines distributed SQL engine and X-Paxos-based database storage technology, providing a brand new cloud-native distributed database.

The X-Paxos algorithm is based on Multi-Paxos with a leader node. As I mentioned in the lecture on consensus, this is a Paxos algorithm that has been proven to be the most efficient in a large number of engineering practices.

Building on the basic algorithm and combining with Alibaba’s business scenarios, high performance, and ecological requirements, X-Paxos has made many innovative features and performance optimizations. Compared to the basic Multi-Paxos, X-Paxos has more abundant functionality and significant performance improvements. Here, I will introduce several optimization points of X-Paxos.

Leader Election #

X-Paxos is based on the standard Multi-Paxos and supports online addition/deletion of nodes with multiple roles. It also supports online and fast transfer of leadership from one node to another. This online operation and maintenance capability greatly facilitates planned operation and maintenance of distributed nodes, reducing business recovery time.

Awareness of Availability Zone Locations #

Alibaba currently has a multi-site architecture with a demand for central data centers. For example, due to the deployment characteristics of an application, it often requires writing to the database only in the central region without city-level disaster recovery. At the same time, in the event of city-level disaster (all data centers in the same city becoming unavailable), the master node can be switched to a non-central city without losing any data.

Node Function Trimming #

In the Paxos algorithm, each node includes the Proposer/Accepter/Learner functions, and each node is a full-function node. However, in some cases, not all nodes in the cluster need to have all functions. X-Paxos uses the following trimming methods:

  1. Trim the state machine of some nodes, keeping only the logs (nodes without data, but used in synchronization for Quorum calculation). In this case, the Proposer function in the protocol needs to be trimmed, while the Accepter and Learner functions are retained.
  2. Some nodes only subscribe/consume the log stream generated by the protocol and do not act as members of the cluster. In this case, the Proposer/Accepter functions in the protocol can be trimmed, and only the Learner function is retained.

Through the combination of these trimming methods, cluster utilization can be improved, costs can be saved, and more flexible function combinations can be obtained.

This is a series of attempts made by PolarDB-X using consensus algorithms. Next, let’s take a look at how Apache Cassandra achieves adjustable consistency.

Apache Cassandra: Adjustable Consistency #

Apache Cassandra provides adjustable consistency, allowing developers to trade off between data consistency and availability. This flexibility is managed by the client. Consistency can be global or adjusted for individual read and write operations. For example, high consistency is required when updating important data. For less critical applications or services, consistency can be relaxed to achieve better performance.

Cassandra’s adjustable consistency, as I mentioned in the consistency lecture of this module, can be divided into write consistency and read consistency.

Write Consistency #

Write consistency declares how many nodes need to be written to for a write operation to be considered successful. Cassandra’s write consistency can be adjusted from strong consistency to weak consistency. I have summarized the following table to illustrate this.

Drawing 1.png

We can see that the ANY level actually corresponds to eventual consistency. Cassandra uses the hinted handoff technique mentioned in the lecture on entropy to ensure the reliability of written data. When a write node fails, the data is temporarily stored in the hinted handoff queue and can be restored when the node recovers.

Read Consistency #

For read operations, the consistency level specifies how many replica nodes must respond to a read query before returning data. Here is another table I’ve compiled for you.

Drawing 2.png

Cassandra uses read repair to fix expired data on replicas during reading. This repair process is a background thread and does not block reads.

These are some details of how Apache Cassandra implements adjustable consistency. AWS DynamoDB and Azure CosmosDB also offer similar adjustable consistency options for users to choose from. You can compare Cassandra’s patterns with the documentation of these databases for further study.

Summary #

Distributed systems are the backbone of distributed databases, and they work together with storage engines to implement the full functionality of distributed databases. Storage engines generally affect the writes to the database, i.e., the performance of the database, determining how quickly the database can process data. At the same time, distributed systems handle communication between nodes, i.e., the scalability of the database, determining the database’s scalability and data capacity. Therefore, the two need to work together, and when designing the use of a database, a certain trade-off can be made between them to achieve efficient use of various resources.

In this module on distributed systems, we have introduced commonly used knowledge points in distributed databases, especially transactions, consistency, and consensus, which are the core of this module. We hope you can master them well.

In the next module, we will enter actual case studies, where I will categorize and introduce typical distributed databases on the market. By analyzing them using the knowledge learned in this course, I hope to better help you in using them.

Thank you for studying, and see you in the next lecture.