09 Database Optimization Solution Ii How to Realize Database Sharding When Data Writing Increases

09 Database Optimization Solution II - How to Realize Database Sharding When Data Writing Increases #

Hello, I’m Tang Yang.

In the previous lesson, we learned about one optimization solution for databases under high concurrency: read-write separation. It relies on the technology of master-slave replication to replicate data into multiple copies, enhancing the database’s ability to handle a large number of concurrent read requests, improving query performance, and also increasing data security. When a database node, whether it’s the master or slave, fails, we still have other nodes storing the full set of data to ensure data is not lost. At this point, the architecture of your e-commerce system looks like this:

At this point, the company’s CEO suddenly announces good news: the promotion efforts continuously bring in traffic, resulting in an order volume of over 50 million for your e-commerce system. The order data is stored in a single table, which has increased your workload exponentially. Both the database’s query performance and write performance have decreased, and the database disk space is running low. Therefore, you proactively analyze the issues you need to consider at this stage and seek efficient solutions to ensure the normal operation of the system. The main issues you consider are as follows:

The system is continuously growing, with an increasing number of registered users and orders, resulting in a growing amount of data stored in the database. The data volume of a single table has exceeded millions or even billions. Even with the use of indexes, the index space also increases with the data volume, and the database can no longer cache the full set of index information. As a result, the index data needs to be read from the disk, which affects query performance. So, how do you improve query performance in this scenario?
The increase in data volume also occupies disk space, making database backup and recovery times longer. How can you make the database system support such a large amount of data?
Different module data, such as user data and user relationship data, are all stored in a single master database. Once the master database fails, all modules will be affected. How can you achieve fault isolation for different modules?
You already know that when performing a benchmark on MySQL 5.7 on a 4-core, 8GB cloud server, it can support approximately 500 TPS and 10,000 QPS. You can see that the database’s write performance is weaker than its query performance. So, as the system’s write request volume increases, how does the database system handle higher concurrent write requests?

These problems can be summarized as performance and availability issues caused by a large influx of database write requests. To solve these problems, you need to shard the data, which can effectively distribute the database’s read and write pressure and overcome the storage bottleneck of a single machine. One common approach is to implement “sharding and partitioning” for the database.

Sharding and partitioning is a very common technical solution that you should be familiar with. Now, you might ask, “Since this technology is so common and I am already familiar with it, why are you bringing up this topic?” Based on my past experience, many people tend to make mistakes when it comes to “sharding and partitioning,” mainly in the following areas:

They only have a partial understanding of how to correctly use sharding and partitioning methods and have not fully grasped the usage scenarios and methods. For example, some students may not use the partition key in queries.
After introducing sharding and partitioning, they fail to find suitable solutions to the problems that arise. For example, using a large number of join queries in queries.

In this lesson, I will help you solve these two problems and move beyond the common pitfalls.

How to Perform Vertical Database Sharding #

Database sharding is a common method to split data into smaller portions. The basic idea is to distribute data evenly across multiple database nodes or tables according to a certain strategy.

Unlike full data replication in master-slave replication, with database sharding, each node only stores a portion of the data. This effectively reduces the amount of data stored in a single database node or table, addressing the storage bottleneck and improving query performance. Furthermore, since data is distributed across multiple database nodes, write requests are no longer directed to a single master database but to multiple shard nodes, thereby improving concurrent write performance to some extent.

For example, I worked on a live streaming project in which I needed to store user messages in chat rooms as well as system messages within the live rooms. As you can imagine, the volume of these messages was enormous. It was common to have tens of thousands of comments in popular chat rooms, accumulating billions of data records over time. This led to performance and storage issues that couldn’t be managed by a single database. As a result, I had to work overtime to refactor the system, launching multiple databases to distribute the write and storage load. I also had to migrate the data from the original single database to the newly launched database nodes. Thankfully, I successfully completed the task of database sharding and data migration, but it did require a significant amount of time and effort.

There are two ways to perform database sharding: vertical sharding and horizontal sharding. Both are crucial but understanding the sharding principle is the core. Therefore, when studying, it is best to consider your own business scenario.

Vertical sharding, as the name suggests, involves splitting the database vertically, meaning dividing the database tables across multiple different databases.

The principle behind vertical sharding is generally based on business type. The core idea is to specialize databases for specific purposes, splitting highly coupled tables into separate databases. An illustrative example would be organizing clothes, where down jackets, sweaters, and T-shirts are stored in different compartments. This approach addresses the third issue I mentioned at the beginning: splitting data from different businesses across different database nodes. Therefore, if a database failure occurs, it only affects a specific module’s functionality instead of the entire system, achieving fault isolation at the data level.

Let’s use the example of a Weibo (Microblogging) system to illustrate.

In a Weibo system, there are tables related to users, content, and relationships, all of which are initially stored in the main database. After sharding, we would expect the tables related to users to be split into a user database, the tables related to content to be split into a content database, and the tables related to relationships to be split into a relationship database.

Performing vertical database sharding is a relatively conventional approach, and you will likely use it quite often. However, even after sharding, it does not guarantee that all issues will be resolved. Vertical sharding cannot handle situations where data volume in a specific business module significantly expands. In such cases, you will need to find other ways to address the issue.

For example, if the number of relationships in Weibo exceeds billions, a single database or table will no longer meet storage and querying requirements. In this case, you would need to further split the data across multiple databases and tables, which means performing horizontal database sharding.

How to horizontally split a database #

Different from vertical splitting, where the focus is on business relevance, horizontal splitting refers to splitting a single data table into multiple databases and multiple data tables according to certain rules, focusing on the characteristics of the data.

There are two common rules for splitting:

Split according to the hash value of a field. This rule is more suitable for entity tables, such as user tables and content tables. Generally, we split these entity tables based on the ID field. For example, if we want to split the user table into 16 databases and 64 tables, we can first hash the user ID. The purpose of hashing is to distribute the IDs as evenly as possible. Then we can take the modulus of 16 to get the index value after database splitting, and take the modulus of 64 to get the index value after table splitting.

Another commonly used method is to split based on the range of a field, often using a time field. In the content table, for example, there is a field called “creation time,” and we often want to view the content posted by someone based on time. In this case, we can split the databases and tables based on the time range of the creation time field. For example, we can store one month’s data in one table. When querying, we can first locate which table the data is stored in based on the creation time, and then query based on the query conditions.

Generally, this kind of splitting is suitable for list data, such as orders within a certain period of time and content posted within a certain period of time. However, this method may have obvious hotspots. It is understandable because you are more likely to care about what you bought and what you posted recently, so the query QPS will also be higher, which may affect the performance. Also, after using this splitting rule, the data tables need to be created in advance. Otherwise, if it is already New Year’s Day in 2020 and the database administrator forgets to create the tables, there will be no tables available for writing data in 2020, which will cause a failure.

After the database is split, the way data is accessed also changes significantly. Previously, we only needed to query data from the slave database based on the query conditions. Now, we need to first confirm which database and table the data is in, and then query the data from that database and table. This complexity can also be solved by using a database middleware, as we have already explained in the previous section. However, I want to emphasize again that you need to have a sufficient understanding of the principles of the database middleware and a strong ability to manage and control its operation.

However, you should know that while splitting a database can solve the scalability issue, it also brings some problems in terms of usage.

Resolving issues introduced by sharding #

One of the major issues introduced by sharding is the introduction of sharding keys or partition keys, which are the fields we use to shard the database.

As you can see from the sharding rules, whether it’s hash-based sharding or range-based sharding, we first need to select a database field. This brings up a problem: all our queries from now on will need to include this field in order to find the data in the correct database and table. Otherwise, we would have to send query commands to all databases and tables. If, as mentioned earlier, we need to shard into 16 databases and 64 tables, a single data query would turn into 16 * 64 = 1024 queries, clearly leading to poor performance.

Of course, there are always more solutions than problems. For this problem, we have some corresponding solutions. For example, in our user database, we use ID as the partition key. If we need to query users by nickname, you can shard based on the nickname as well, but this would greatly increase storage costs. What if in the future we need to query by registration time? Shard based on registration time as well?

Therefore, the most appropriate approach is to establish a mapping table between nicknames and IDs. When querying, you first query the ID through the nickname, and then use the ID to query the complete data. This table can also be sharded, occupying a certain amount of storage space, but because the table only has two fields, it still saves a significant amount of space compared to re-sharding.

Another problem introduced by sharding is that some database features may become difficult to implement. For example, when multiple tables are joined in a single database, it can be accomplished with a single SQL statement. However, after sharding across multiple databases, it is no longer possible to execute SQL statements across databases. Fortunately, our need for join operations is not high, and even if there is a need, it is usually done by extracting the data from both tables and performing filtering in the business logic code. Although it may be somewhat complex, it can still be achieved. Another example is that before sharding, counting the total number of records only required executing count() in the SQL statement. Now that the data is scattered across multiple database tables, we may need to consider other solutions, such as storing the count data separately in a table or recording it in Redis.

Of course, although sharding may bring some inconveniences to our use of databases, we still need to do it because of the scalability and performance improvements it brings. After all, only by going through sharding can we overcome the limitations of single-machine capacity and request volume bottlenecks. Just like the e-commerce system I mentioned in the beginning, it was through sharding that it was able to resolve the performance decline and capacity bottlenecks caused by the large amount of data in the order table.