10 How Does an ID Generator Ensure Global Uniqueness After Database Sharding

10 How Does an ID Generator Ensure Global Uniqueness After Database Sharding #

Hello, I am Tang Yang.

In the previous two lessons, I introduced you to the two core issues of distributed storage: data redundancy and data sharding, as well as how they are solved in traditional relational databases. When we face high-concurrency data query requests, we can use master-slave read-write separation to deploy multiple slave databases to share the read pressure. When the amount of stored data reaches a bottleneck, we can shard the data storage across multiple nodes to reduce the storage pressure on each individual node. At this point, our architecture looks like this:

img

As you can see, we have solved the scalability problem of the database through sharding and master-slave read-write separation. However, in Lesson 09, I also mentioned that after sharding, there are many limitations when using the database. For example, queries must include the partition key; some aggregate queries (like count()) have poor performance and require considering other counter-based solutions. There is another problem with sharding that I didn’t mention in Lesson 09, which is the issue of global uniqueness of primary keys. In this lesson, I will guide you to understand how to generate globally unique database primary keys after sharding.

However, before exploring this issue, you need to have an understanding of the question “what field to use as the primary key”. This will prepare us for exploring how to generate globally unique primary keys later.

How to Choose the Primary Key for a Database? #

Every record in a database needs a unique identifier. According to the second normal form of a database, each table in the database should have a unique primary key, and other data elements correspond to the primary key one by one.

So the choice of primary key becomes crucial. Generally speaking, you have two options:

  1. Use a business field as the primary key. For example, for a user table, you can use the mobile phone number, email, or ID card number as the primary key.

  2. Use a generated unique ID as the primary key.

However, for most scenarios, the first option is not applicable. For example, it is difficult to find a business field that can serve as a primary key for a comment table because it is hard to find a field in the comment table that uniquely identifies a comment. For the user table, we need to consider whether the chosen business field as the primary key can uniquely identify a person. A person can have multiple emails and phone numbers, so it is not appropriate to use email or phone number as the primary key because changing the email or phone number would require modifying all the referenced foreign key information.

Although the ID card number is indeed a unique identifier for a user, it is not a required attribute for a user system due to its privacy nature. Just think about it, if your system does not require real-name authentication, it certainly will not require users to fill in their ID card numbers. Moreover, existing ID card numbers can change. For example, in 1999, the ID card number changed from 15 digits to 18 digits. But once the primary key changes, the tables that reference this primary key as foreign keys also need to be updated, which is a huge amount of work.

Therefore, I am more inclined to use a generated ID as the primary key for the database. Not only because of its uniqueness, but also because once generated, it does not change and can be referenced freely.

In the scenario of a single database and single table, we can use the database’s auto-increment field as the ID because it is the simplest and transparent for developers. However, when the database is sharded into multiple databases or tables, using an auto-increment field can no longer guarantee the global uniqueness of the ID.

Just imagine, after sharding the database, the data of the same logical table is distributed across multiple databases. If we use the database’s auto-increment field as the primary key, it can only guarantee uniqueness within that database, not globally. So, when designing a user system, if you use an auto-increment ID as the user ID, it is possible to have two users with the same ID, which is unacceptable. In that case, I suggest setting up an ID generator service to generate globally unique IDs.

Building an ID generator based on the Snowflake algorithm #

In my experience with various projects over the years, I have primarily used a variant of the Snowflake algorithm to generate the IDs required for business purposes. The focus of this lesson is to use the algorithm to solve the problem of global ID uniqueness. You might ask, “Why don’t you mention UUID when it comes to global uniqueness?”

Indeed, UUID (Universally Unique Identifier) does not rely on any third-party system, making it good in terms of performance and availability. I usually use it to generate a Request ID to tag individual requests. However, if used as a database primary key, it has the following issues.

Firstly, the generated ID should have monotonically increasing properties, meaning they should be ordered, which UUID lacks. Why should IDs be ordered? Because IDs may become sorting fields in system design. Let me give you an example.

Suppose you want to implement a comment system, which generally consists of two tables: a comment table storing detailed comment information, including an ID field, the comment content, the commenter’s ID, the ID of the commented content, and so on, with the ID field used as the partition key; the other is a comment list table, storing the correspondence between content ID and comment ID, with the content ID as the partition key.

When retrieving the comment list for a piece of content, we need to sort it in reverse chronological order. Since the ID is ordered in terms of time, we can sort by comment ID in reverse order. If the comment IDs are not ordered in terms of time, we would need to store an additional column for creation time in the comment list, assuming that content ID, comment ID, and time are stored using 8 bytes, we would need to allocate an extra 50% of storage space for the time field, leading to storage waste.

Another reason is that ordered IDs can improve data write performance.

We know that the MySQL InnoDB storage engine uses a B+ tree to store index data, and the primary key is also a type of index. Index data is ordered in the B+ tree, as shown in the following example, where 2, 10, and 26 are the IDs of the records and represent index data.

img

Therefore, when the next record with a monotonically increasing ID is inserted, such as inserting 30, the database only needs to append it to the end. However, if the inserted data is unordered, for example, the ID is 13, the database needs to find the position where 13 should be inserted and move the data after 13 accordingly, resulting in unnecessary data movement overhead.

img

We know that mechanical hard disks need to perform “seeking” to find the position to write to when completing random writes, which involves moving the read/write head to the corresponding track. This process is time-consuming. On the other hand, sequential writes do not require seeking and can greatly improve the write performance of indexes.

Another reason why UUID cannot be used as an ID is that it lacks business significance. In reality, IDs used often contain meaningful data that appear at fixed positions within the ID. For example, the first six digits of an ID card represent the area code, digits 7 to 14 represent the individual’s birthdate, different cities have different area codes for phone numbers, and you can determine which operator a cell phone number belongs to based on the first three digits. If the generated ID can be reverse-engineered, we can use the deciphered information to validate the ID. We can determine the generation time, which ID generator in the data center generated it, and which business it was serving, which can be helpful for troubleshooting.

Finally, UUID is a string composed of 32 hexadecimal digits, which consumes more space when used as a database primary key.

As you can see, the UUID solution has significant limitations, which is why I do not recommend using it. On the other hand, the Snowflake algorithm proposed by Twitter can completely compensate for the shortcomings of UUID. It not only has a simple and easy-to-implement algorithm, but also satisfies the global uniqueness and monotonicity that IDs require, while also including certain business significance. The core idea behind Snowflake is to divide a 64-bit binary number into several parts, each storing data with specific meanings, such as a timestamp, machine ID, and a sequence number, to generate globally unique ordered IDs. The standard algorithm is as follows:

img

From the above image, we can see that a 41-bit timestamp can support approximately pow(2,41)/1000/60/60/24/365 years, which is about 69 years, which is sufficient for a system.

If your system is deployed in multiple data centers, the 10-bit machine ID can be further divided into 2 to 3 bits representing the IDC (supporting 4 or 8 IDC data centers) and 7 to 8 bits representing the machine ID (supporting 128-256 machines). The 12-bit sequence number represents the maximum number of IDs that can be generated per millisecond for each node, which is 4096.

Different companies may modify the Snowflake algorithm based on the characteristics of their own business. For example, reducing the number of sequence number bits and increasing the number of machine ID bits to support more machines in a single IDC, or adding a business ID field to differentiate different businesses. For example, the composition rule of the ID generator I am currently using is: 1 bit for compatibility (always 0) + 41 bits for timestamp + 6 bits for IDC information (supporting 64 IDCs) + 6 bits for business information (supporting 64 businesses) + 10 bits for incrementing information (supporting 1024 IDs per millisecond).

I chose this composition rule mainly because I deploy only one ID generator node in a single data center and use KeepAlive to ensure availability. The business information refers to which business module in the project uses the ID, such as the user module or the content module. By adding it, I hope that the IDs generated by different businesses can be different. Additionally, adding the business information allows for ID reverse lookup to determine which business generated the ID in case of issues.

Now that we understand the principles of the Snowflake algorithm, how do we engineer it to generate globally unique IDs for our business? Generally, there are two common implementation methods:

One approach is to embed the algorithm in the business code, distributed among the business servers. The advantage of this approach is that the business code does not need to make cross-network calls when using the algorithm, which improves performance. However, it requires more machine ID bits to support more business servers. Furthermore, since there are many business servers, it is difficult to ensure the uniqueness of machine IDs, so a distributed consistency component like ZooKeeper is needed to ensure that each machine can obtain a unique machine ID upon restart.

The other deployment approach is to deploy the algorithm as a standalone service, commonly known as the ID generator service. Using the ID generator requires an additional network call, but the performance impact of internal network calls is limited. This approach can reduce the number of machine ID bits. If the ID generator is deployed in a master-backup configuration with only one generator running at a time, the machine ID can be omitted, leaving more bits for the incrementing information. Even if a machine ID is required, it can be written in the generator’s configuration file, ensuring the uniqueness of the machine ID without the need for third-party components. Weibo and Meituan both deploy their ID generators as standalone services, achieving a throughput of 20,000 IDs per second on a single instance with a single CPU.

The Snowflake algorithm is designed to be simple and efficient, allowing for the generation of globally unique, monotonically increasing IDs with business meaning. However, it also has some drawbacks, the biggest one being its reliance on system timestamps. If the system clock is not accurate, duplicate IDs may be generated. Therefore, if we discover that the system clock is inaccurate, the ID generator can temporarily refuse to generate IDs until the clock is accurate again.

In addition, if the QPS (queries per second) of ID generation is not high, for example, if the generator only generates one ID per millisecond, the last digit of the generated ID will always be 1. This will cause uneven distribution of databases and tables if the ID is used as the partition key. This is a pitfall I encountered in a real project, and there are two main solutions:

  1. Instead of recording milliseconds, record seconds as the timestamp, allowing for the generation of multiple IDs within the same time interval to prevent uneven data allocation when performing sharding.

  2. Randomize the starting number of the sequence number. For example, 21 for one second and 30 for the next second, to balance the distribution as much as possible.

As I mentioned at the beginning, the ID generator used in my actual project is a variant of the Snowflake algorithm, meaning that I made certain modifications to the Snowflake algorithm. From the content above, you can see that these modifications aim to make the ID generation rules align with the characteristics of my business and address issues like time rollback.

In addition to adopting the Snowflake algorithm, large companies also use other methods, such as Didi and Meituan proposing ID generation based on databases. These approaches are rooted in the companies’ businesses and also solve the problem of global uniqueness of IDs in distributed environments. For you, it is beneficial to understand different methods from multiple perspectives to find a solution that best suits your business’s current needs. However, I want to emphasize that the focus should be on quality rather than quantity when it comes to solutions. There is no “best” solution, only the most suitable one. Fully understanding the underlying principles of a method and applying it effectively is your best choice.

Course Summary #

In this lesson, I used my project experience to explain how to use the Snowflake algorithm to solve the problem of globally unique database IDs after sharding. In this problem, I also extended the discussion to the requirements of generating IDs, such as monotonicity and meaningfulness. Of course, the focus of our discussion was how to implement the Snowflake algorithm and how to solve the challenges we encountered during the implementation process.

The Snowflake algorithm is not complicated. When using it, you don’t need to consider independent deployment. Instead, you should first think about how many binary bits each part of the Snowflake algorithm should occupy based on your own business scenario. For example, how many IDCs will your business deploy? How many application servers will be deployed? What is the requirement for the number of IDs generated per second, and so on. Then, implement a simple version of the algorithm in your business code. When the number of application servers reaches a certain scale, you can consider independent deployment. This approach can avoid maintaining an additional number generator service and reduce the complexity of operations and maintenance.