22 Distributed Database Architecture Thoroughly Understand What a Distributed Database Is

22 Distributed Database Architecture - Thoroughly Understand What a Distributed Database Is #

In the previous three modules, we learned the core concepts of MySQL architecture design, including table structure design, index design, and high availability design. I believe that with the knowledge gained from the previous content, you are now capable of performing MySQL database architecture design effectively.

Starting from this module, we will delve into the complex realm of architecture design and learn about the design of distributed databases.

We all know that internet applications have become ubiquitous, and the amount of data is constantly increasing. For internet businesses such as Taobao, Meituan, and Baidu, traditional single-instance databases are unable to meet the performance and storage requirements. Therefore, distributed architecture has experienced significant development.

For developers and DBAs, it is crucial to recognize that database technology is going through a major transformation. It is beneficial to gain an early understanding of distributed architecture design and assist companies in transitioning from outdated single-instance architectures to distributed architectures. This knowledge will significantly improve your competitiveness in the workplace.

Without further ado, let’s dive into the topic of distributed architecture design. In this lecture, let’s start by understanding what a “distributed database” is.

Concept of Distributed Database #

According to the official definition on Wiki, a distributed database is:

A distributed database is a database in which data is stored across different physical locations. It may be stored in multiple computers located in the same physical location (e.g. a data centre); or maybe dispersed over a network of interconnected computers.

From this definition, we can understand that a distributed database is a database that stores data across different physical locations.

In contrast to the databases we have previously studied, where data is stored on a physical storage corresponding to a single instance, in a distributed database, data is stored on different database instances.

Diagram 0.png

Architecture of a Distributed Database

From the diagram, we can see that in a distributed database, there are three layers: compute layer, metadata layer, and storage layer:

The compute layer is equivalent to the SQL layer in a single-instance database. It is responsible for performing access authorization, routing access, and other operations related to data manipulation.
The metadata layer records the information about how many storage nodes are in the distributed database cluster and their corresponding IP addresses, ports, etc. When the compute layer of a distributed database starts, it first accesses the metadata layer to obtain all the cluster information to ensure the correct parsing and routing of SQL queries. Additionally, because the metadata information is stored in the metadata layer, there can be multiple compute layers in a distributed database to achieve performance scalability.
The storage layer is responsible for storing the data. However, the storage layer needs to be on the same server as the compute layer, or at the very least, in the same process.

As we can see, the advantage of a distributed database is that it distributes data across different servers. This horizontal scaling ability, known as Scale Out, can address the performance and storage bottlenecks of single-instance databases.

Theoretically, the performance of a distributed database can achieve linear improvement with the expansion of the compute and storage layers.

From a availability perspective, if the storage layer experiences a failure, it will only impact 1/N of the data, where N depends on the number of servers the data is distributed to. Therefore, the availability of a distributed database is significantly higher compared to a single-instance database. Achieving 99.999% availability with a single-instance database may be difficult, but with a distributed database, it is much easier.

Of course, distributed databases also have their drawbacks. Due to data distribution, distributed databases introduce new problems such as implementing auto-increment, index design, and distributed transactions (which will be discussed specifically in later content). Next, let’s take a look at the overall architecture of a distributed MySQL database.

Distributed MySQL Architecture #

Before studying the architecture of distributed MySQL, let’s first look at the architecture of single-machine MySQL.

Drawing 1.png As you can see, in the original architecture, the client accessed the MySQL database through the MySQL communication protocol, and the MySQL database would use high availability technology to perform replication and switch over in the event of a crash.

So, for the architecture of distributed MySQL databases, the overall architecture is shown in the following figure:

Drawing 2.png

From the above diagram, you can see that the data is distributed and stored in various MySQL instances below, and each piece of data is called a “shard”.

In the distributed MySQL architecture, the client no longer accesses the MySQL database itself but instead accesses a distributed middleware.

This distributed middleware still uses the MySQL communication protocol (because that’s how the client originally accessed MySQL) to communicate. Based on metadata information, the distributed middleware will automatically route user requests to the corresponding MySQL shard below, thereby accessing the specified node.

Furthermore, every layer of the distributed MySQL architecture needs to be highly available to ensure the high availability of the distributed database architecture.

For the upper layer distributed middleware, it can be horizontally scaled: that is, users can access multiple distributed middleware instances, and if one instance crashes, it can be removed directly.

Because the distributed middleware is stateless and the data is stored in the metadata service, its high availability design is relatively easy.

For the metadata, although it has a small amount of data, the data is crucial, and any crash could cause the middleware to fail, so the metadata needs to be protected by replication technology to ensure high availability.

Finally, each shard storage itself has replicas, which ensure the availability of the shard through the high availability technology we studied earlier. In other words, if the MySQL server of shard 1 crashes, the slave server of shard 1 will take over as the new master server and continue to provide services.

However, due to the use of distributed architecture, even if shard 1 crashes and takes 60 seconds to recover, this downtime only affects 1/N of the data requests for business access.

Overall, the distributed MySQL architecture achieves the separation of the computing layer and the storage layer, and each layer can scale out in parallel. Each layer ensures the continuity of the computing layer and the storage layer through high availability technology, greatly improving the performance and reliability of the MySQL database and laying a solid foundation for serving massive internet businesses.

Summary #

In today’s lecture, we mainly studied the concept of distributed databases and understood that distributed databases distribute data to different nodes to improve performance and reliability. In addition, to implement a distributed database, the computing layer and the storage layer of the database need to be separated.

Next, we learned about the architecture of distributed MySQL databases: the distributed MySQL architecture uses a middleware routing layer to shield the underlying MySQL shards.

Since the distributed middleware communicates using the MySQL protocol, developers can use the distributed middleware just like they used MySQL. For developers, all of this is transparent, and they don’t need to worry about how many shards there are in the lower layer. All routing and computation work is handled by the middleware layer.

This lecture is relatively simple and serves as the foundation for the next few lectures. I suggest you read it repeatedly and try to understand the architecture of distributed MySQL databases.