39 Discuss the Common Distributed ID Design Solutions, Does the Snowflake ID Get Affected by the Leap Second Transition

39 Discuss the common distributed ID design solutions, does the Snowflake ID get affected by the leap second transition #

Most of the topics in this column focus on the Java language and virtual machine, mostly in the context of single-machine modes. Today, I will supplement with a distributed related question. Strictly speaking, distributed computing is not considered a Java-specific domain, but rather a separate major topic. However, it is indeed a topic that may be involved in Java technical job interviews. When preparing for interviews, it is good to have rich experience in distributed systems. If not, you can choose to prepare for typical problems and basic technologies. Regarding distributed computing, my own practical experience is very limited, so I will discuss some thoughts starting from theory in this column.

The question I want to ask you today is, discuss common design solutions for distributed IDs? Does Snowflake get affected by daylight saving time changes?

Typical Answer #

First, we need to clarify the typical definition of distributed IDs, which includes:

Globally unique: This means that the ID needs to be unique within the distributed system, different from the uniqueness of a single-point system.
Sequential: Typically, the generated IDs need to be sequentially increasing. For example, in scenarios like database storage, sequential IDs facilitate determining data positions and are often more efficient.

Currently, there are many solutions in the industry, and typical ones include:

Implementation based on database auto-increment sequences. This approach has clear advantages and disadvantages. The advantage is that it is simple and easy to use, but it has limitations in terms of scalability and reliability.
Implementation based on Twitter’s early open-source solution Snowflake, as well as related modification schemes. This is currently a widely used approach, and you can refer to the diagram below for its structure definition.

Snowflake

The overall length is usually 64 bits (1 + 41 + 10 + 12 = 64), suitable for storage using the long data type in Java.

The first bit is the sign bit.

Next is the high-order part containing a 41-bit timestamp, usually obtained from System.currentTimeMillis().

After that, there are 10 bits for the WorkerID. The standard definition is 5 bits for the data center and 5 bits for the machine ID, forming a machine number to distinguish different cluster nodes.

The last 12 bits represent the theoretical limit of the number of sequence numbers that can be generated within a millisecond.

The official version of Snowflake is based on the Scala language, and there are many reference implementations in other languages such as Java. It is a very simple and practical approach. The definition of the bits can be modified according to the real scenario of the distributed system and does not have to strictly follow the design in the diagram.

Redis, ZooKeeper, MongoDB, and other middleware also have various unique ID solutions. Some of these designs can also be considered as variations of the Snowflake solution. For example, MongoDB’s ObjectId provides a 12-byte (96-bit) ID definition, where 32 bits are used to record time in seconds, the machine ID is 24 bits, 16 bits are used for the process ID, and a 24-bit random starting count sequence.

Some major Chinese companies have open-sourced their own distributed ID implementations. InfoQ has introduced WeChat’s seqsvr, which adopts a relatively complex two-layer architecture and a targeted design based on the characteristics of social applications. Please refer to the related code implementation for more details. In addition, Baidu, Meituan, and others have also open-sourced or shared different implementations of distributed IDs, which can also be referenced.

As for the second question, “Does Snowflake’s algorithm consider daylight saving time (DST) changes?”

I don’t think it is affected. You can find the answer by looking at the specific algorithm implementation of Snowflake. We know that most Java implementations of the Snowflake algorithm rely on System.currentTimeMillis(). What does this value represent? According to the Javadoc, it returns the number of milliseconds that have elapsed since January 1, 1970, UTC time. This value is not related to DST, so it is not affected by DST changes.

Analysis of the main points #

Today’s problem not only comes from the popular interview topics, but also has extensive applications. The answer I provided earlier was just a brief introduction to a typical solution. I suggest that you conduct in-depth analysis for specific solutions to ensure that you are well-prepared when the interviewer may ask follow-up questions. It is necessary to understand the design details if the existing system happens to use distributed IDs.

When it comes to distributed systems, many simple problems in a single-machine mode suddenly become complex. This is the inherent complexity of distributed systems, and it requires us to understand the application scenarios, architecture, and detailed algorithms from different perspectives. I will provide appropriate interpretations from the following perspectives:

What kind of distributed ID does our business actually need? Besides being unique and ordered, what other factors must be considered?
In practical scenarios, what are the possible limitations or problems for typical solutions, and what measures can be taken to solve them?

Knowledge Expansion #

If we attempt to delve into this question, we first need to clarify the essential requirements of the business scenario. What kind of distributed ID do we actually need?

In addition to being unique and ordered, considering the functional requirements of distributed systems, we usually also hope that distributed IDs can guarantee the following:

Meaningful, or containing more information, such as time and business-related information. This aspect is somewhat related to the requirement for ordering. If the ID contains the timestamp, it can inherently ensure a certain degree of order, although it cannot guarantee it absolutely. Including additional information in the ID can help further optimize data access efficiency in distributed data storage scenarios.
High availability, which is an inevitable requirement of distributed systems. Among the solutions mentioned earlier, some are truly distributed while others follow a traditional master-slave approach. There is no absolute right or wrong in this regard, it depends on the requirements of our business in terms of scalability, performance, and other aspects.
Compactness, the size of the ID may be restricted by practical applications. For example, long IDs are often not friendly to database storage, as excessively long IDs can reduce the performance of databases such as MySQL. Programming languages may also be limited by data type length when handling IDs.

In specific production environments, there may also be specific requirements for things like QPS (Queries Per Second), especially at the scale of first-tier internet companies in China, where the level of demand for peak business scenarios needs to be considered.

Secondly, let’s analyze the pros and cons of the mainstream solutions.

For the database auto-increment solution, in addition to being simple to implement, the generated IDs can also guarantee an increment with a fixed step, making it very convenient to use.

However, because triggering a database write request is required for each ID retrieval, it is a costly operation. Building a highly scalable and high-performance solution is more complex, and the performance limit is obvious, not to mention the difficulty of handling scaling scenarios. At the same time, ensuring high availability for the database solution also poses challenges, as databases may experience downtime. Even with various measures such as master-slave hot backups, ID duplication problems may still occur.

In practice, major companies often build multi-layered composite architectures. For example, Meituan’s publicly available database solution, Leaf-Segment, introduces a caching layer called Leaf, and database operations are performed through batch operations provided by database middleware. This approach can guarantee both performance and scalability, as well as high availability. However, this solution requires many requirements at the infrastructure level and may not be suitable for the needs of ordinary business scales.

In comparison, the advantages of the Snowflake solution are its simple algorithm, minimal dependencies, predictable sequence generation, and excellent performance. For example, Twitter’s peak exceeds 100,000 IDs per second.

However, it also has certain limitations, such as the issue of clock skew. We know that the clocks in regular computer systems cannot guarantee long-term consistency and may experience problems like clock ticking backward, which can cause inaccurate timestamps and result in duplicate IDs. Regarding this, Twitter once suggested enabling NTP in their documentation, as Snowflake relies on time. However, there were also suggestions to disable NTP. Personally, I believe NTP should still be enabled, but the stepback should be set to 0 to prevent time shifts.

From a design and coding perspective, there is also an effective measure of caching historical timestamps and validating them before generating sequences. If there is an unreasonable situation where the current time is behind the historical time, appropriate actions can be taken, such as retrying, waiting for the clock to synchronize again, or simply indicating that the service is unavailable.

In addition, the predictability of the sequence number is a double-edged sword. Although it simplifies some engineering issues, it is not suitable for many business scenarios. It is very dangerous to use it as a security token, as it can easily be guessed and exploited by hackers.
The information exposed by the ID design stage needs to be carefully considered. For example, the flake implementation of the Erlang version calculates the WorkerID based on MAC address, which is often not allowed in security-sensitive domains.
In theory, similar schemes like Snowflake have theoretical limits similar to the 2038 problem due to the limitation of the number of bits for time data. Although it is still too early for current system designs to consider problems decades later, it is necessary to understand these possible limits, which may become a point of investigation during interviews.

If you delve deeper into the issues of clocks and distributed system sequencing, there are issues related to distributed IDs that are different but related. For example, in a distributed system, the time on different machines is likely to be inconsistent. How to ensure the order of events? Lamport’s 1978 paper, Time, Clocks, and the Ordering of Events in a Distributed System, provides a deep analysis of this topic. Interested students can look for relevant translations and interpretations.

Finally, I would like to add some hot topics in the current distributed field that are often discussed in interviews, such as:

Distributed transactions, including their causes, business background, and mainstream solutions.
Understanding theories such as CAP and BASE, thinking about problems from the perspective of eventual consistency, and understanding consensus algorithms such as Paxos and Raft.
Understanding typical implementations of distributed locks, such as the most common Redis distributed lock.
Typical algorithms in the distributed field, such as load balancing. At least, understanding the principles of the main solutions.

These aspects have been relatively thoroughly analyzed, especially based on the practical experience of first-tier companies. In addition, the “Programmer’s Leveling Guide” column by the Left Ear Wind provides comprehensive learning materials on distributed systems. Interested students can refer to it.

Today, I briefly reviewed the current typical distributed ID generation schemes and discussed some considerations in ID design, especially the shortcomings of the widely used Snowflake. I hope this helps you.

Practice Lesson #

Do you have a clear understanding of the topic we discussed today? Today’s reflection question is, from a theoretical perspective, Snowflake, a time-based algorithm, naturally limits the concurrent generation of IDs in terms of form. If more IDs are needed in a short period under extreme circumstances, what can be done to resolve this?

Please write your thoughts on this question in the comments section. I will select thoughtful comments and give you a learning reward voucher. Welcome to discuss with me.

Are your friends also preparing for interviews? You can “ask a friend to read” and share today’s question with them. Perhaps you can help them.