36 When to Consider Table Splitting and Database Sharding

36 When to consider table splitting and database sharding #

Hello, I’m Liu Chao.

In today’s Internet era, handling massive amounts of data is a common requirement for most mature products, especially in mobile Internet products where data is generated almost every day. For example, a shopping mall might generate millions of orders daily, a payment system might have millions of transaction records, and a game might produce battle reports, and so on.

For a shopping mall with millions of daily active users, the number of orders generated daily could be in the millions or even tens of millions during promotional periods.

If we implement this using a single table, where hundreds of thousands or even millions of data entries are generated daily, the table’s performance will severely degrade within a month. This is because MySQL, using the InnoDB storage engine, implements indexes based on B+-trees. The number of I/O operations during queries largely depends on the height of the tree. As the height of the B+-tree increases, the number of I/O operations increases, and query performance deteriorates.

When faced with a table containing a massive amount of data, we usually consider optimization solutions such as table partitioning, NoSQL storage, and table and database splitting.

Although table partitioning is also based on the principle of table splitting, it is still performed within a single database. The optimization space is very limited in some scenarios that require improved concurrency. Moreover, a table can support a maximum of 1024 partitions. Optimizing storage capacity is limited when dealing with ever-growing massive data. However, in some large tables that do not contain massive amounts of data, we can consider using table partitioning to optimize table performance.

A partitioned table is implemented by multiple related underlying tables, and these underlying tables are represented by handle objects. So, we can also directly access each partition. The storage engine manages the underlying tables of the partition and handles ordinary tables similarly (all underlying tables must use the same storage engine). The index of a partitioned table is simply an additional index added to each underlying table. From the perspective of the storage engine, there is no difference between an underlying table and an ordinary table, and the storage engine does not need to know whether it is an ordinary table or part of a partitioned table.

On the other hand, NoSQL storage is based on key-value pairs. Although its query performance is very high, it still has some limitations. For example, it is not a relational database, it does not support transactions, and its stability may be relatively inferior to that of RDBMS. Although some NoSQL databases have implemented transactions and advertise reliable stability, NoSQL is primarily used for auxiliary storage at present.

When should we consider sharding and partitioning? #

After analyzing partitioning and NoSQL storage optimization, let’s move on to the main topic of this section - sharding and partitioning.

In my opinion, if possible, we should avoid sharding and partitioning. In the case of a single table, if the business is running smoothly, we can use a single table. However, when we encounter performance bottlenecks, we should first consider partitioning as an optimization solution. Only when partitioning is not enough to solve the problem, should we then consider sharding and partitioning.

We know that in the case of a single table and single database, when the amount of data in the database table gradually accumulates to a certain level (such as 50 million rows or 100GB), the performance of database operations will noticeably decrease. Even with index optimization or read-write separation, performance limitations still exist. At this point, if the daily data growth is significant, we should consider sharding to prevent a single table from becoming too large and impeding the performance of database operations.

In the face of massive data, apart from the poor performance of a single table, other resources such as database connection numbers, disk I/O, network throughput, and concurrency are also limited. Therefore, in some scenarios where there is a large amount of data and high concurrency, we need to consider sharding and partitioning to enhance the concurrent processing capabilities of the database and improve the overall performance of the application.

How to do table and database sharding? #

Usually, table and database sharding can be divided into vertical sharding and horizontal sharding.

Vertical sharding means splitting the database based on business needs, where different businesses use different databases. For example, in a flash sale business, both orders and coupons require high concurrency. If they are using the same database, it would occupy a certain number of connections. Therefore, we can separate the database into an order database and a promotions database.

On the other hand, vertical table sharding involves splitting a table into two tables based on the fields within that table. The rule is to move some infrequently used fields to another table. For example, if an order details table has over a hundred fields, it is obviously too large. On one hand, it is inconvenient for development and maintenance. On the other hand, it may cause pagination issues. In this case, we can split the fields of this table to solve the above two problems.

Horizontal sharding involves dividing a table based on a specific column and using certain rules (such as range or hash modulo) to split it into smaller tables.

Horizontal sharding only occurs within a single database. If there are bottlenecks in terms of connection count, I/O read/write, or network throughput, we need to consider distributing the horizontally sharded tables to different databases on different machines. This is known as horizontal database and table sharding.

By combining vertical and horizontal sharding, we can generally categorize databases into: one database, one table - one database, multiple tables - multiple databases, multiple tables. In normal business development, we should prioritize one database, one table. If the data volume is large and the hot data is heavily concentrated while historical data is rarely accessed, we can consider table partitioning. If the access to hot data is dispersed and almost all data will be accessed, we can consider one database, multiple tables. If there is high concurrency, massive data, and a huge daily increase in data volume, we can consider multiple databases, multiple tables.

It is worth noting that we should avoid sharding as much as possible. Once we shard the tables, we may encounter problems such as pagination queries across multiple tables and JOIN queries across multiple tables, which would increase the complexity of the business. Furthermore, once the databases are sharded, apart from cross-database pagination queries and cross-database JOIN queries, there will also be issues with cross-database transactions. These problems undoubtedly increase the complexity of our system development.

Problems Faced After Sharding and Database Partitioning #

However, despite the various problems that arise from sharding and database partitioning, they are still the most commonly used optimization techniques in some businesses with massive data and high concurrency. Therefore, we should fully consider the problems that we may face after carrying out sharding and database partitioning, and now let’s take a look at the strategies to deal with them.

To better understand these problems, we will analyze them by applying sharding and database partitioning to an order table with detailed business scenarios.

Suppose we have an order table and an order details table, with a daily data growth rate of 600,000 orders. There are also some promotional activities that increase the number of orders to millions. In order to improve the system’s concurrency capability, we consider sharding and partitioning the order table and order details table. In addition to sharding, because users generally query the most recent order information, the hot data is concentrated. We can also consider using table partitioning to optimize the single table queries.

Usually, order sharding and partitioning can be implemented based on either hashing the order number or hashing the user ID. The benefit of hashing the order number is that the data can be evenly distributed into different tables, but the drawback is that when a user wants to query all orders, they need to query multiple tables.

Since there are many user queries on the order table, we should consider using the user ID field for hashing and horizontal sharding of the order table. If we need to consider the processing capacity of orders during high concurrency, we can consider sharding and partitioning based on hashing the user ID field. This is also the way most companies handle sharding and partitioning of order tables.

1. Distributed Transaction Issues #

When placing an order, besides creating the order, we also need to deduct the corresponding inventory. However, due to vertical partitioning of the order table and inventory table, which reside in different databases, we need to use distributed transactions to ensure transactional integrity when submitting the order.

Usually, we solve distributed transactions using two common methods: Two-Phase Commit (2PC) and TCC (Try-Confirm-Cancel) compensation-based transaction. I will provide a detailed explanation of distributed transactions in Lesson 41.

Common middleware already encapsulates the implementation of these two methods. For example, Spring implements JTA (Java Transaction API), and currently Alibaba’s open-source distributed transaction middleware Fescar has well-implemented compatibility with Dubbo.

2. Cross-Node JOIN Query Issues #

When users query orders, we often need to join tables to retrieve product information, and the product information table may be in another database, which involves cross-database JOIN queries.

Usually, we optimize cross-database JOIN queries by duplicating tables or adding redundant fields. For basic tables like the product information table, we can duplicate a basic table in each sharded order database to avoid cross-database JOIN queries. For queries involving one or two fields, we can also redundantly store a small number of fields in a table to avoid JOIN queries and, consequently, cross-database JOIN queries.

3. Cross-Node Pagination Query Issues #

When users query all orders in the order list, we can quickly retrieve the order information by using the Hash value of the user ID. On the other hand, operators in the back-end query the order table based on the payment time. These data are distributed in different databases and tables, thus creating a problem of cross-node pagination queries.

Usually, some middleware initially queries a certain amount of data in each table, sorts the data in the cache, and then retrieves the corresponding page data. This method consumes more performance as the querying progresses.

Generally, we recommend using two sets of data to solve the problem of cross-node pagination queries. One is based on the single or multiple query data of the sharded order database, and the other is based on order data stored in Elasticsearch or Solr, primarily for pagination queries by operators based on other fields. To avoid affecting the business performance of submitting orders, we usually use asynchronous messaging to implement the addition and modification of order data in Elasticsearch or Solr.

4. Global Primary Key ID Issues #

After sharding and partitioning, using auto-increment to achieve primary keys is no longer possible, and we need to design a separate global primary key ID to avoid duplicate primary keys in different tables and databases.

Using UUID to implement a global ID is the most convenient and quick method, which randomly generates a 32-digit hexadecimal number. This method ensures the uniqueness, horizontal scalability, and high performance of a UUID. However, the biggest drawback of using UUID is that it is a relatively long string with poor continuity. If used as a primary key, its performance will be relatively poor.

We can also use Redis distributed locks to implement an incrementing primary key ID. This method ensures that the primary key is an integer and has some continuity, but using distributed locks incurs a certain performance cost.

We can also solve the global primary key ID problem by using snowflake, a distributed ID generation algorithm open-sourced by Twitter. Snowflake generates a long-type primary key ID by segmenting it into time, machine identification, and sequential counting, and it can generate tens of thousands of global IDs per second, providing not only good performance but also low latency.

5. Scaling Issues #

As the number of user orders increases, the data volume in the sharded tables based on hashing the user ID also gradually accumulates. At this point, we need to consider dynamically adding tables, which involves data migration.

When designing the table data volume at the beginning, it is advisable to set the number of tables as multiples of 2. When we need to scale, we should also increase the number of tables in multiples of 2. This approach can reduce the amount of data migration.

Summary #

Before starting business development, we should first design tables based on our own business needs. Considering the slow initial business development and short development cycle, we should try to avoid considering table and database sharding when time is tight. However, we can reserve the business interfaces for table and database sharding, consider the sharding rules in advance for future sharding, and proactively denormalize the redundant fields to avoid JOIN queries and other operations across tables and databases during sharding.

When the business develops rapidly, we need to assess the necessity of table and database sharding. Once sharding is needed, we should plan the sharding rules in advance based on the business, and try to avoid performance-consuming operations such as JOIN queries across tables and databases, pagination queries, and cross-database transactions.

Thinking Questions #

Which sharding middleware have you used? Feel free to share their implementation principles and pros and cons.