23 Distributed Database Table Structure Design How to Properly Shard Data

23 Distributed Database Table Structure Design - How to Properly Shard Data #

In the previous 22 lectures, we briefly learned about the architecture of distributed databases and understood that all types of distributed databases require three layers: the computation layer, storage layer, and metadata layer.

Furthermore, it is important to note that distributed databases store data in individual shards. In a distributed database architecture based on MySQL, shards exist within MySQL instances.

In today’s lecture, we will learn about a very important design in distributed databases: correctly sharding data to fully leverage the advantages of distributed database architecture.

Selecting a Shard Key #

When sharding data in a table, the first step is to select a shard key. This key allows users to horizontally split the data.

For the orders table in the e-commerce business that we previously used, the table structure is as follows:

CREATE TABLE `orders`
(
    `O_ORDERKEY` INT NOT NULL,
    `O_CUSTKEY` INT NOT NULL,
    `O_ORDERSTATUS` CHAR(1) NOT NULL,
    `O_TOTALPRICE` DECIMAL(15,2) NOT NULL,
    `O_ORDERDATE` DATE NOT NULL,
    `O_ORDERPRIORITY` CHAR(15) NOT NULL,
    `O_CLERK` CHAR(15) NOT NULL,
    `O_SHIPPRIORITY` INT NOT NULL,
    `O_COMMENT` VARCHAR(79) NOT NULL,
    PRIMARY KEY (`O_ORDERKEY`),
    KEY `idx_custkey_orderdate` (`O_CUSTKEY`,`O_ORDERDATE`),
    KEY `ORDERS_FK1` (`O_CUSTKEY`),
    KEY `idx_custkey_orderdate_totalprice` (`O_CUSTKEY`,`O_ORDERDATE`,`O_TOTALPRICE`),
    KEY `idx_orderdate` (`O_ORDERDATE`),
    KEY `idx_orderstatus` (`O_ORDERSTATUS`),
    CONSTRAINT `orders_ibfk_1` FOREIGN KEY (`O_CUSTKEY`) REFERENCES `customer` (`C_CUSTKEY`)
) ENGINE=InnoDB

For large-scale applications like Taobao, Jingdong, and Pinduoduo, a single MySQL database instance cannot meet the performance and storage capacity requirements of events like “Double 11” or “618.” In order to address this, the database needs to be transformed into a distributed database architecture.

The first step in this transformation is to select a shard key for the table and design the distributed architecture.

For the orders table mentioned above, possible shard keys include o_orderkey, o_orderdate, or o_custkey. After selecting the shard key, the next step is to choose a sharding algorithm. The most common options are RANGE and HASH algorithms.

For example, let’s say we choose o_orderdate as the shard key for the orders table. By using the YEAR function to determine the year of each order and applying the RANGE algorithm, we can design a distributed database architecture based on RANGE sharding algorithm:

From the above diagram, we can see that using the RANGE algorithm for sharding, the orders data for the year 1992 is stored in shard 1, the orders data for 1993 is stored in shard 2, and the orders data for 1994 is stored in shard 3. This pattern continues, and if we want to store data for a new year, we can simply add a new shard.

However, the RANGE sharding algorithm is not ideal for distributed database architecture as it does not address the two pain points of traditional single-instance databases:

Scalable performance: By adding shard nodes, performance can be linearly improved.
Scalable storage capacity: By adding shard nodes, the data storage capacity bottleneck of a single node can be overcome.

If we continue to shard the data further, say by each day using the RANGE algorithm, it will be better to some extent. However, for large-scale events like “Double 11” or “618,” there will still be a single shard handling the workload, resulting in concentration of hot spots.

Therefore, in a distributed architecture, the RANGE partitioning algorithm is not an effective choice. However, it does have its advantages, such as facilitating data migration between different machines. For example, if we want to migrate the 1992 data in shard 2 to shard 1, we can simply migrate the table.

For high-concurrency OLTP (Online Transaction Processing) businesses, it is generally recommended to use the HASH sharding algorithm. This way, each shard node can have real-time access and the load can be balanced across all nodes, achieving linear scalability in terms of performance and storage.

Let’s take a look at the orders table using HASH sharding based on o_orderkey. The sharding algorithm is as follows:

In the above sharding algorithm, o_orderkey is the shard key, and the total number of shards is 4 (meaning the original data is divided into 4 separate tables). Specifically, the sharding algorithm involves performing a modulo operation on o_orderkey divided by 4.

With this approach, the resulting distributed design after applying the HASH sharding algorithm to the orders table is as shown in the following diagram:

As we can see, data with an o_orderkey that results in a remainder of 0 when divided by 4 is stored in shard 1, a remainder of 1 is stored in shard 2, and a remainder of 2 is stored in shard 3, and so on.

This kind of sharding design based on the HASH algorithm is more suitable for large-scale internet businesses and truly meets the requirements of elastic scalability in distributed database architecture.

However, is selecting o_orderkey as the shard key the best choice for the orders table? Not necessarily.

Let’s take a look at the other tables in the database, such as the customer and lineitem tables. These three tables are often used together, for example, when querying the most recent order details for a customer.

If we use o_orderkey as the shard key, then l_orderkey can be used as the shard key for the lineitem table. However, we would find that the customer table does not contain any information related to orders, meaning we cannot use orders as the shard key.

If the customer table chooses another field as the shard key, it would make it impossible to achieve unified business data. In other words, for the customer, orders, and lineitem tables, the sharded data would still reside on the same database instance.

Therefore, if we want to achieve the unitization of sharded data, the best choice is to use the customer field as the shard key. In the customer table, c_custkey is used as the shard key, o_custkey is used as the shard key in the orders table, and l_custkey is used as the shard key in the lineitem table:

The advantage of this approach is that when querying based on user dimensions, all operations can be completed on a single shard without involving cross-shard access. For example, consider the following SQL query:

SELECT * FROM orders

INNER JOIN lineitem ON o_orderkey = l_orderkey

INNER JOIN customer ON o_custkey = c_custkey

WHERE o_custkey = 1

ORDER BY o_orderdate DESC LIMIT 10

So, the principles of distributed database architecture design are: choose a suitable sharding key and sharding algorithm to distribute the data, and most of the business queries are based on the sharding key.

So why is the Internet business so suitable for distributed architecture design? Because most of the Internet business is To C business, and the sharding key is the user’s ID. Most of the business access is based on the user ID, such as:

View microblogs/short videos of a certain user;
View product information/purchase records of a certain user;
View the balance information of a certain user.

After learning the selection of sharding keys, the next step is to plan the shards, which is what we often refer to as sharding.

Sharding #

After talking about sharding for so long, what is sharding exactly? Actually, the shards mentioned earlier are essentially tables, not database instances. Strictly speaking:

Shard = Instance + Database + Table = ip@port:db_name:table_name

For the previous table orders, assuming sharding is done according to the HASH algorithm, the following sharding and table design can be done:

The table name and database name for each shard are the same, such as database tpch, table orders;
The database name is different for each shard, but the table name is the same, such as database names tpch01, tpch02, tpch03, tpch04, and table name orders;
The table name is different for each shard, but the database name is the same, such as database name tpch, and table names orders01, orders02, orders03, orders04 respectively for each shard;
Both the database name and table name are different for each shard, such as the table for shard 1 is in the database tpch01 with the table name orders01; the table for shard 2 is in the database tpch02 with the table name orders02; the table for shard 3 is in the database tpch03 with the table name orders03; the table for shard 4 is in the database tpch04 with the table name orders04.

Among these 4 sharding and table rules, the most recommended is the fourth rule, which is also what we usually mean by sharding. The advantages of doing this are as follows:

Data from different shards can be stored in the same MySQL database instance, making it easy to plan capacity and scale in the future;
Tables with the same sharding key are in the same database, making it convenient for overall data migration and expansion.

If sharding and table splitting according to the fourth rule, the architecture of a distributed MySQL database can be like this:

Did you notice that, with the above distributed design, even after the data is sharded, all the databases and tables are still on the same MySQL instance! Remember, distributed databases do not necessarily require many instances. The basic requirement is to shard the data. Afterwards, users can scale the database up or down according to their needs, achieving scalability in terms of performance and capacity. This is the true charm of a distributed database.

For the distributed database architecture mentioned above, initially we stored the data of 4 shards in one MySQL instance. However, if we encounter some large promotional activities, we can scale it up, for example, expanding the 4 shards to 4 MySQL instances:

If the large promotional activities are completed, we can reclaim the resources and put all the shards back into one MySQL instance. This is called resource downsizing.

In general, scaling a distributed database is a common operation for internet companies. For example, for Alibaba, they start capacity assessment for the Double 11 event in the second half of every year, and then plan the database scaling based on the assessment results.

Usually, after the Double 11 event in e-commerce, there are also Double 12, New Year, and Spring Festival events, so the database resizing typically lasts until after the Chinese New Year. Next, let’s take a look at how to scale the database up or down.

Scaling Up and Down

In the example of hashing sharding, we shard the data into 4 nodes. However, in a production environment, to facilitate future scaling operations, it is recommended to start with no less than 1000 shards.

Do not worry about having too many shards because managing one shard or 1000 shards is the same. However, having 1000 shards means it can scale up to 1000 instances, which is sufficient for most businesses (BTW, there is a rumor that the number of shards for a core business of Alibaba’s distributed database is 10,000).

If even with 1000 shards it is still not enough to meet the business needs, can we split it into 2000 shards? Theoretically, it is possible, but it means logically splitting the data in a table, which is a very complex task and usually not recommended.

So, it is necessary to design a sufficient number of shards from the start. In my actual work experience, I have encountered many cases where businesses split the number of shards from 32 or 64 to 256 or 512. Every time, it was a lot of work for little gain. Therefore, designing a distributed database properly is of utmost importance!

So how do we scale up in a MySQL database? Essentially, it involves setting up a replication structure and then using replication filtering to only play back the databases where the shards reside. The configuration on the slave server for this database looks roughly like this:

# Shard 1 slave server configuration

replicate_do_db = "tpch01"

Therefore, when scaling up, the first step is to configure replication filtering for the shards to be scaled up according to the figure below:

Then, during a low peak period of business, redirect the business requests to the new shards to complete the final scaling operation:

As for downsizing, it is essentially the inverse operation of scaling up, so I won’t go into it here.

Summary #

In today’s lesson, we learned about sharding design in the architecture of distributed databases, which is commonly referred to as sharding and partitioning. I hope that through this lesson, you have firmly grasped the following points:

In the sharding of distributed databases, you need to first select one or more fields as sharding keys.
The requirement for a sharding key is that it is frequently accessed by the business, and most of the tables in the business can be unitized based on this sharding key.
If you cannot find a suitable sharding key, it is not possible to transform the business into a distributed database.
After selecting the sharding key, you need to choose the sharding algorithm, typically either the RANGE or HASH algorithm.
For massive OLTP businesses, it is recommended to use the HASH algorithm and strongly discouraged to use the RANGE algorithm.
Once the sharding key and sharding algorithm are selected, you need to design the database and table partitions, recommending different names for databases and tables so that it is easy to scale the sharded data up or down in the future.
During actual scaling up, you can use replication filtering to only replicate the necessary shard data.

Today’s content is very practical, and I hope you will read it repeatedly to master the most basic and important knowledge points in the design of a distributed database architecture. See you in the next lesson.