03 Data Sharding How to Store Massive Datasets

03 Data Sharding - How to Store Massive Datasets #

In the previous two lectures, we introduced distributed databases and the development of various SQL databases. Starting from this lecture, we will officially dive into the core principles of distributed databases.

With the advent of the internet era, especially mobile internet, various enterprises are rapidly upgrading and iterating their system platforms as part of their transformation to the internet.

In this context, the database systems that these application platforms rely on need to support the sudden influx of massive transactional data. However, in such cases, a monolithic database easily becomes overloaded. The most common technique used to scale databases is “data sharding”.

Therefore, in this lecture, I will introduce what sharding is and how it can be used to scale databases. I will also review the advantages and disadvantages of common sharding architectures, using TiDB as an example, and discuss how sharding can be implemented in distributed databases.

Introduction to Data Sharding #

Sharding is the process of partitioning a large data table into smaller tables (called shards) that are distributed across multiple database cluster nodes. Sharding can be seen as a horizontal scaling technique and is essentially a partitioned table in a traditional database. Each shard contains a subset of the original dataset, allowing the total workload to be distributed among the shards.

There are generally two ways to shard data:

Horizontal sharding: Storing different rows of the same table in different database nodes.
Vertical sharding: Storing different columns of a table in different database nodes.

As shown in the following figure, the concepts of horizontal and vertical sharding come from the visual representation of the original relational database table schema.

Drawing 0.png

Figure 1: Visual representation

The idea of sharding actually comes from the theory of marginal utility in economics: when the investment keeps increasing, but the increase in returns begins to decline, it is called the state of diminishing marginal returns. The point at which it starts to decline is called the marginal balance point.

This theory can be applied to the computational capacity of databases and is often described as follows: if the database processing capacity encounters a bottleneck, the simplest way is to continuously improve the system performance, such as replacing with a more powerful CPU, larger memory, etc. This mode is called vertical scaling. However, vertical scaling has its limitations when continuously increasing resources to enhance database capacity, eventually reaching a marginal balance point where the returns start to decline.

At this point, horizontally sharding a table means that more computing power can be introduced to handle data and transactions. This can reverse the diminishing marginal returns to increasing marginal returns. By continuously balancing the processing load and data volume on all nodes, sharding can achieve the effect of 1+1>2, where the average processing capacity of the cluster is greater than the processing capacity of a single node.

This makes horizontally scalable clusters consisting of smaller and cheaper servers potentially more cost-effective than maintaining a large commercial database server. This is also the core technology background of the “IOE reduction” mentioned in the first lecture.

In addition to solving the scalability problem, sharding can also mitigate unplanned downtime and greatly reduce the system’s RTO (Recovery Time Objective). Even during planned downtime, if there is no sharding, the database as a whole will still be inaccessible, which cannot meet the business requirements for SLO (Service Level Objectives).

If sharding works as expected, it can ensure high availability of the system. As long as other nodes are running, even if some nodes in the database cluster fail, the database as a whole can still provide services externally. Of course, this also requires guarantees of replication and consistency services, which we will discuss further in subsequent lessons.

In summary, sharding can increase the total capacity and speed of a database cluster, while providing higher availability at a lower cost compared to vertical scaling.

Sharding Algorithms #

Sharding algorithms generally refer to the algorithms needed for horizontal sharding. After many years of evolution, they have been widely practiced in large systems. Now, I will introduce two of the most common horizontal sharding algorithms and briefly discuss some optimization ideas for other sharding algorithms.

Hash Sharding #

In hash sharding, a shard key is first obtained, and then its hash value is calculated using a specific hashing algorithm. Finally, the hash value is used to determine which shard the data should be placed in. Databases usually use a unified hashing algorithm (such as ketama) for all data to ensure that the hash function evenly distributes data among servers, thus reducing the risk of data imbalance. By using this method, it is unlikely for the data to be placed on the same shard, thus ensuring the data is distributed randomly.

This algorithm is suitable for scenarios with random reads and writes, as it can effectively distribute the system load. However, it is not suitable for range scan queries. The figure below illustrates how this algorithm works. Drawing 1.png

Figure 2 Sharding by Hash

Range Sharding #

Range sharding partitions data based on the range of data values or key space. Adjacent shard keys are more likely to fall into the same shard. Unlike hash sharding, no transformation is required for each data entry; they are simply categorized into different shards. The diagram below illustrates how range sharding works.

Drawing 2.png

Figure 3 Range Sharding

Range sharding requires selecting appropriate shard keys. These shard keys should not contain duplicate values; in other words, the candidate values should be as discrete as possible. Additionally, the data should not be monotonically increasing or decreasing. Otherwise, the data cannot be well-distributed in the cluster and may result in hotspots.

Range sharding is ideal for range-based queries, but it has weaker performance for random read and write operations.

Hybrid Algorithms #

We should recognize that the hash and range sharding algorithms described above are not mutually exclusive options. On the contrary, they can be flexibly combined.

For example, we can establish a multi-level sharding strategy where the hash algorithm is used at the top level, and within each hash-based shard unit, the data is stored in a sequential order.

This algorithm is relatively simple and flexible. Now let’s talk about a geolocation algorithm.

Geolocation Algorithm #

This algorithm is commonly used in NewSQL databases to enable data distribution on a global scale.

In the geolocation-based sharding algorithm, data is mapped to specific shards, and these shards are mapped to specific regions and nodes within those regions.

Then, within a given region, data is further sharded using the hash or range sharding algorithm. For example, a cluster running in three regions: the United States, China, and Japan, can rely on the Country_Code column of the User table to map the data rows of specific users to regions that adhere to proximity rules.

These are some typical sharding algorithms. Now let’s continue the discussion on how to apply sharding algorithms to practical scenarios.

Manual Sharding vs Automatic Sharding #

As the name suggests, manual sharding involves setting static rules to distribute data to database nodes based on the sharding algorithm. This is usually done when the database being used does not support automatic sharding, such as MySQL or Oracle. This problem can be solved by implementing data sharding at the application layer or using simple database middleware or proxies to set static sharding rules.

The disadvantage of manual sharding is uneven data distribution. Uneven distribution may lead to extremely imbalanced database loads, causing some nodes to be overloaded while others have lower access traffic.

Therefore, it is best to avoid storing too much data on certain nodes, as it can cause those nodes to become hotspots, resulting in slower performance or even server crashes. Additionally, this problem can also occur when the overall dataset is very small, as only a few nodes in the cluster will have data.

This situation may be acceptable in development and testing environments, but it is unacceptable in production environments. Uneven data distribution, hotspots, and storing data on too few shards can deplete the computing resources of database cluster nodes, leading to an unstable system.

However, with careful design and minimal changes in data distribution, manual sharding can be a relatively simple and cost-effective solution.

On the other hand, using automatic sharding means that the compute nodes and sharding algorithm can work together, enabling the database to scale elastically.

Range-based sharding makes it easy to achieve automatic sharding: just split or merge each shard.

For example, suppose we have a shard with a range of [1, 100), and we want to split it into two ranges. We can select 50 as the splitting point and divide the range into [1, 50) and [50, 100). Afterwards, we can move these two ranges to different database nodes to achieve a balanced system load.

Shard-based Sharding #

Shard-based sharding can bring about hotspots for reading and writing, but we can eliminate these hotspots by splitting and moving shards.

On the other hand, implementing automatic sharding in a hash-based sharding system is very costly. Let’s use the example in the above Figure 1 to illustrate.

The current system has 4 nodes, and then a new database node is added. In the hash function, “n” is changed from 4 to 5, which leads to significant system instability. Although you can use a consistency hashing algorithm like Ketama to minimize system instability, data migration and rebalancing operations are still necessary.

This is because after applying the hash function, the data is randomly distributed, and adjusting the hashing algorithm will definitely change the distribution of most data.

Automatic sharding is the mainstream feature of distributed databases, and all major distributed databases, even database middleware, are attempting automatic sharding. I will illustrate this with several case studies.

Sharding Algorithm Examples #

Data sharding is a core feature of database middleware, and there are many open-source projects in this field. I will use the sharding content of Apache ShardingSphere as an example to introduce relevant practical cases of sharding algorithms.

Shard Key Generation #

ShardingSphere first provides distributed primary key generation, which is crucial for generating shard keys. Since a distributed database typically involves multiple database nodes, the primary key generation based on the database instance is not suitable for distributed scenarios.

The commonly used algorithms are UUID and Snowflake, two stateless generation algorithms.

UUID is the simplest method, but its generation efficiency is not high, and the data dispersion is average. Therefore, the latter algorithm is currently used in production environments. The following diagram shows the structure of the shard key generated using this algorithm.

Drawing 3.png

Figure 4 Shard Key Structure

There are three valid parts:

Timestamp: The algorithm is similar to the representation of UNIX time, which is the number of milliseconds from a specific time to the current time point. In this case, this algorithm can be used for nearly 70 years.
Worker Node ID: Ensures that each independent working database node does not produce duplicate data.
Access Sequence: Within the same process and the same millisecond, ensures that the generated IDs are not duplicated.

Flexible Sharding Algorithm #

In order to ensure the flexibility of shard calculation, ShardingSphere provides standard sharding algorithms and some tools to help users implement personalized algorithms.

Using PreciseShardingAlgorithm together with a hash function, hash sharding can be implemented. RangeShardingAlgorithm can be used to implement range sharding.
Using ComplexShardingStrategy, multiple shard keys can be used to implement fusion sharding algorithms.
Sometimes, the sharding patterns of data tables are not completely consistent. For some special sharding patterns, HintShardingStrategy can be used to specify special routing rules at runtime without using unified sharding configurations.
If users want to implement special sharding algorithms such as geographic location algorithms, they can customize the sharding strategy. It can be written using inline expressions or Java code. The former is based on configuration and does not require compilation, making it suitable for simple personalized shard calculations. The latter can implement more complex calculations, but it requires compilation and packaging.

Through the above multiple shard tools, users can flexibly and uniformly specify their database sharding strategies.

Automatic Sharding #

ShardingSphere provides Sharding-Scale to support elastic scaling of database nodes, which is its support for automatic sharding. The following diagram shows the demonstration of the automatic sharding feature, and we can see that after the scaling feature of Sharding-Scale, the original two databases have been expanded to three. Drawing 4.png

Figure 5 Automatic Sharding Feature Demonstration

The automatic sharding feature consists of the four processes shown in the diagram below.

Drawing 5.png

Figure 6 Automatic Sharding Process

From Figure 6, it can be seen that with this capability, ShardingSphere can support complex hash-based automatic sharding. At the same time, we should also note that without professional and automated elasticity scaling tools, achieving automatic sharding is very difficult.

These are practical examples of sharding algorithms, using the classic horizontal sharding pattern. Currently, there is a trend towards further merging horizontal and vertical sharding. Next, we will introduce TiDB, which represents this fusion trend.

Fusion of Vertical and Horizontal Sharding #

TiDB is a typical example of the fusion of vertical and horizontal sharding, and it is also a HATP fusion solution.

The horizontal scaling relies on the underlying TiKV, as shown in the diagram below.

Drawing 6.png

Figure 7 TiKV

TiKV uses range sharding, where data is assigned to regions. A group maintains three replicas, which ensures high availability (more details will be explained in “05 | Consistency and CAP Theorem: Why Do We Need Distributed Consistency?”). When a region becomes larger, it is split, and new split regions also generate multiple replicas.

The horizontal scaling of TiDB relies on TiFlash, as shown in the diagram below.

Drawing 7.png

Figure 8 TiFlash

From Figure 8, it can be seen that TiFlash is a column extension plugin for TiKV. Data is asynchronously copied from TiKV to TiFlash and then undergoes columnar transformation. MVCC technology is used to ensure data consistency.

The aforementioned region will add a new asynchronous replica. This replica then splits the data and combines it into TiFlash in columnar mode, achieving the fusion of horizontal and vertical scaling in the same database. This is the fusion of two database engines.

The benefits brought by this fusion to TiDB mainly manifest in the query layer, especially in terms of efficiently performing aggregation queries on specific columns. TiDB can intelligently switch between these two sharding engines to achieve the optimal query efficiency.

Summary #

This concludes this lesson. We first explained the principles of sharding and various commonly used sharding techniques in detail. We then analyzed the differences between manual sharding and automatic sharding, as the future of data sharding lies in automatic sharding.

Finally, I introduced how sharding techniques are applied to distributed databases through two well-known open-source projects. The HATP technology demonstrated by TiDB, which combines two sharding modes, can be seen as the development trend of future sharding patterns.

Mutual Learning #

Here’s a question for you to think about after class.

Design a complex sharding algorithm that can scale the nodes without data migration for a certain period of time, while ensuring that hotspots are not generated.