01 From Theory to Practice How to Make Database Sharding and Table Splitting Truly Effective

01 From Theory to Practice How to Make Database Sharding and Table Splitting Truly Effective #

This lesson mainly discusses how to implement sharding and partitioning in real life.

In the process of developing internet systems, sharding and partitioning is not a new concept. Many developers have some understanding of sharding and partitioning and know its use cases to some extent. However, it is not very clear how to actually implement sharding and partitioning. Of course, the meaning and implementation of sharding and partitioning are much more complicated than their literal meanings. This leads us to the core topic I want to discuss today: how to truly implement sharding and partitioning.

Starting from the Evolution of Data Storage and Access #

To answer the question “how to implement sharding and partitioning in real life?”, let’s start with a typical case: imagine that in an e-commerce system, there is an order table. During the initial operation of the system, data is usually stored and accessed using a single database and a single table. Because the data volume is small, the bottleneck of database access is not obvious.

As the business evolves, when the system needs to support large-scale e-commerce operations, it may generate hundreds of thousands or even millions of orders per day, and the access to the order table will start to bottleneck.

Taking MySQL, a commonly used relational database in internet systems, as an example, although a single table in MySQL can theoretically store data up to billions of rows, the access performance becomes poor at this point. Even with various optimization strategies, the effect is usually minimal. It is widely accepted in the industry that a single MySQL table is in the optimal state when the data volume is below 10 million. Once it exceeds this scale, other solutions need to be considered.

Since a single table in relational databases like MySQL cannot support storing and accessing large amounts of data, naturally, you may think of using NoSQL databases like MongoDB to manage the data.

However, this is not a very good option for several reasons: on the one hand, the relational ecosystem is very mature. Relational databases have undergone decades of continuous development and possess stability and reliability that NoSQL databases cannot compare to. On the other hand, the transactional characteristics of relational databases are a core feature that other data storage tools do not have. Currently, the core data of the vast majority of companies is stored in relational databases, and for internet companies, MySQL is the mainstream data storage solution.

Now that we have chosen a relational database, we can consider using the sharding and partitioning solution to solve the bottleneck problem of a single database table. This is the common method used by the internet industry to handle massive amounts of data. The sharding and partitioning solution is more of a supplement to the data storage and access mechanism of relational databases, rather than a complete overhaul. So what exactly is sharding and partitioning?

What is Data Sharding and Partitioning? #

Sharding and partitioning are two separate concepts, but they are often referred to collectively as sharding and partitioning. There is no unified definition of sharding and partitioning in the industry, but you can simply understand it as:

In order to solve the performance deterioration problem caused by the large volume of data, the original independent database is split into multiple databases, and the large tables are split into multiple smaller tables, so that the data volume of a single database and a single table becomes small enough to improve the performance of the database.

There are many forms of sharding and partitioning, let’s take a look together.

Forms of Sharding and Partitioning #

Sharding and partitioning involve two dimensions: database sharding and table sharding. In the development process, both dimensions can adopt two types of splitting strategies: vertical splitting and horizontal splitting:

Let’s first discuss the application of vertical splitting. Compared to horizontal splitting, vertical splitting is relatively easier to understand and implement. In an e-commerce system, when a user opens the homepage, basic data such as user gender and geographical location are often loaded. For the user table, the access frequency of this basic data on the homepage is obviously higher than that of user avatars and other data. Based on the different access characteristics of these two types of data, the user table can be split, storing the low-frequency accessed user avatars and other information in a separate table, and storing the high-frequency accessed user information in another table:

From here, it can be seen that the way to handle vertical sharding is to divide a table into multiple tables according to the fields, and each table stores a subset of the fields. In implementation, we usually place large blob type fields such as avatars or less frequently accessed data in a separate table, and place columns that are frequently used in combination queries in one table. This can also be considered as a manifestation of table sharding.

Vertical sharding can achieve a certain level of performance improvement, but the data is still located in the same database, which means that the operation scope is still limited to one server, and each table will still compete for resources such as CPU, memory, and network IO in the same server. Based on this consideration, after vertical sharding, we can further introduce vertical sharding.

For the scenario introduced earlier, even after sharding, user information is still stored together with other product and order information on the same server. Based on the idea of vertical sharding, at this time the data related to users can be separated and placed in a separate database.

This is the effect of vertical sharding. In terms of definition, vertical sharding refers to classifying tables according to the business and distributing them to different databases. Then, each database can be located on different servers, and the core concept is specialized databases. In terms of implementation, vertical sharding depends largely on business planning and the division of system boundaries. For example, the independent separation of user data needs to consider the relationship between the system’s user system and other business modules, and it is not simply a matter of creating a user database. In high-concurrency scenarios, vertical sharding can improve IO access efficiency and the number of database connections to a certain extent, and reduce the bottleneck of hardware resources on a single machine.

From the previous analysis, we can understand that although vertical sharding is relatively easy to implement, it does not solve the core problem of the single-table data volume being too large. Therefore, in reality, we often need to add horizontal sharding mechanisms on the basis of vertical sharding. For example, we can use the modulus of the user ID to store user information in different databases separately. This is a common practice for horizontal sharding:

As can be seen, horizontal sharding splits the data of the same table into different databases according to certain rules, and each database can also be located on different servers. This solution can often solve the problems of storage capacity and performance bottlenecks of a single database. However, because the same table is allocated to different databases, additional routing work is required for data access, which greatly increases the complexity of the system. The so-called rules here are actually a series of algorithms, including:

Modulo algorithm, there are many ways to perform modulo calculations, such as the aforementioned modulo calculation based on user ID, or modulo calculation based on one or more columns of the table;
Range limitation algorithm, range limitation is also very common, for example, strategies such as routing to the target database or table based on year or time can be used;
Predefined algorithm, which means pre-planning the number of specific databases or tables, and then directly routing to the specified database or table.

According to the idea of horizontal sharding, the user table in the user database can also be horizontally split, as shown in the following figure. In other words, horizontal sharding is to split the data of the same table into multiple tables within the same database according to certain rules.

Obviously, the data storage architecture of the system has become very complex. Compared with the single-database, single-table structure before sharding, it now faces a series of challenging problems, such as:

How to efficiently govern multiple databases?
How to perform cross-node association queries?
How to implement cross-node pagination and sorting operations?
How to generate globally unique primary keys?
How to ensure transaction consistency?
How to migrate data?
… If there is no good tool to support data storage and access, data consistency will be difficult to guarantee. This is where the value of sharding middleware like ShardingSphere lies.

Sharding and Read/Write Separation #

When it comes to sharding, we have to introduce another technology system that solves data access bottlenecks: read/write separation, which is related to database master-slave architecture. We know that databases like MySQL provide a complete master-slave architecture, which ensures data synchronization between the master database and the slave databases. Based on the master-slave architecture, we can separate read and write operations according to operational requirements, thereby improving access efficiency. The basic principle of read/write separation is as follows:

As shown in the diagram, there is a master database and a slave database in the database cluster, and the data between the two is synchronized through a synchronization mechanism. In Internet systems, it is generally believed that the frequency of read operations on databases is much higher than that of write operations, so the bottleneck often occurs in read operations. By separating read operations and performing them on independent slave databases, read/write separation can be achieved. In practical master-slave architectures, the number of master and slave databases, especially the number of slave databases, can be expanded based on the size of the data.

Read/write separation mainly solves database access under high concurrency and is a commonly used solution. However, like improving server configuration, it is not the ultimate solution. The ultimate solution is sharding, as mentioned earlier, which involves splitting databases or tables based on rules such as user IDs. However, please note that the relationship between sharding and read/write separation is not exclusive, but can complement each other. It is entirely possible to introduce read/write separation mechanisms based on sharding:

In fact, ShardingSphere, which will be introduced in this course, implements the architecture scheme shown in the diagram, supporting read/write separation while sharding. This process will be explained in subsequent courses.

Sharding Solutions and Representative Frameworks #

Based on the previous discussions on sharding, we can abstract a core concept behind it, namely sharding. Both sharding by databases and sharding by tables involve partitioning data into different data shards and storing them in different target objects. The specific sharding approach depends on different sharding solutions.

In fact, there are many frameworks in the industry that deal with sharding, and these frameworks obviously do not adopt the same solution. However, by analyzing the differences in implementing data sharding schemes in these frameworks, they can be divided into three major types: client-side sharding, proxy server sharding, and distributed databases.

Client-side Sharding #

Client-side sharding means that sharding rules are implemented directly on the database client. Obviously, this approach moves the sharding work to the front end, with the client managing and maintaining all sharding logic, and determining the target database and table for each SQL execution.

Client-side sharding solutions come in different forms, with the simplest form being application-level sharding, which means that sharding rules and logic are maintained directly in the application program:

In terms of implementation, we usually package the sharding rule processing logic into a common JAR package, and other developers only need to include this JAR package in their code projects. For this solution, because there is no dedicated server component, there is also no need to maintain a specific middleware. However, embedding the sharding component directly in the business code has obvious drawbacks:

On the one hand, because the sharding logic is embedded in the business code, business developers need to understand both the business logic and the way sharding rules are processed, increasing development and maintenance costs;
On the other hand, once a problem occurs, it can only be analyzed by business developers by examining the code, unable to separate this work and let a dedicated middleware team complete it. Based on the above analysis, client-side sharding is usually further abstracted to separate the management of sharding rules from the business code, forming a separate and evolving system. A typical design approach in this regard is to rewrite the JDBC protocol, which means embedding sharding rules at the JDBC protocol level. In this way, business developers can still use a set of APIs that are fully compatible with the JDBC specification to operate the database, but behind these APIs, sharding operations are automatically performed, achieving zero intrusion into the business code:

Client-side sharding architecture: rewriting the JDBC protocol

The advantage of this solution is that sharding operations are completely transparent to the business, to some extent separating the responsibilities of business developers and the database middleware team. In this way, business developers only need to understand the JDBC specification to achieve database sharding, reducing the difficulty of development and the cost of code maintenance.

For client-side sharding, typical middleware includes Alibaba’s TDDL and ShardingSphere, which will be introduced in this course. However, since TDDL is not open source, it is not possible to determine the specific implementation of client-side sharding. As for ShardingSphere, it is a typical implementation framework that rewrites the JDBC specification to achieve client-side sharding.

Proxy server sharding #

The solution for proxy server sharding is also clear, which is to use a proxy mechanism by adding a proxy layer between the application layer and the database layer. With the proxy layer, sharding rules can be centrally maintained in this layer, and an API compatible with JDBC can be provided to the application layer. In this way, the business developers in the application layer do not need to care about the specific sharding rules, but only need to implement the business logic:

Clearly, the advantage of proxy server sharding is that it relieves business developers from the management of sharding rules, but the drawback is that it adds an additional proxy layer, which naturally brings some problems associated with the proxy mechanism, such as the performance impact caused by the addition of an additional network transmission.

For proxy server sharding, common open-source frameworks include Alibaba’s Cobar and the community-driven MyCat. In ShardingSphere 3.X version, the Sharding-Proxy module is also added to achieve proxy server sharding.

Distributed database #

In the process of technological development and evolution, one of the major problems with relational databases is the lack of distributed characteristics, meaning the lack of effective data processing mechanisms for dealing with large data volumes and high-concurrency access in distributed environments. For example, we know that transactions are one of the essential features of relational databases, but in a distributed environment, implementing transactions based on traditional relational databases like MySQL faces enormous challenges.

Fortunately, the rise of distributed databases represented by TiDB has given relational databases a certain degree of distributed characteristics. In these distributed databases, data sharding and distributed transactions are built-in basic features that are transparent to business developers. Business developers only need to use the JDBC interface provided by the framework, just like using traditional relational databases like MySQL.

From this perspective, we can also consider ShardingSphere as a middleware for distributed databases. In addition to providing standardized data sharding solutions, it also implements distributed transactions and database governance functions.

Summary #

Conceptually, the basic principles and manifestations of database sharding are not difficult to understand, but the implementation is not so simple. Therefore, the industry has a number of representative solutions, including client-side sharding, proxy server sharding, and distributed databases. These solutions approach the goal of database sharding from different angles. In daily development, we can choose one or more of these solutions. ShardingSphere combines client-side sharding, proxy mechanism, and distributed transactions features. Developers can introduce these features as needed.

Here’s a question for you to think about: What are the specific ways to combine database sharding, table sharding, and read-write separation?

That’s all for this lesson. In the next lesson, we will comprehensively introduce the ShardingSphere framework and see what kind of Apache open-source software it is.