How to design a distributed ID generator?

How to design a distributed ID generator?

Hello everyone, I am Brother Shu.

In complex distributed systems, it is often necessary to uniquely identify large amounts of data and messages, such as the ID primary key of a sharded database or table, the request ID of distributed tracing, and so on.

Therefore, designing a "distributed ID issuer" has become a very common system design problem. Today I will show you how to design a distributed ID issuer.

Article mind map

System Requirements

For business systems, there are generally the following requirements for globally unique IDs:

  • Globally unique. The generated ID cannot be repeated. This is the most basic requirement. Otherwise, primary key conflicts will occur in the scenario of sharding.
  • Monotonically increasing. Ensure that the next ID is greater than the previous ID, so that the data is written sequentially when it is written to the database, improving the writing performance.

For the above two requirements, the first one is required by all systems. The second one is not required by all systems. For example, the request ID of distributed tracing does not need to be monotonically increasing. For those scenarios that need to be stored in the database as an ID, it may be necessary to ensure that the globally unique ID is monotonically increasing.

In addition, we may also need to consider security issues. If a globally unique ID is incremented sequentially, it may cause business information leakage. For example, if the order ID is incremented by 1 each time, then competitors can directly know the number of orders we place every day through the order ID, which is unacceptable for business.

There are many unique ID solutions on the market to meet the above demands. The most common solutions are as follows:

  • UUID
  • Snowflake Algorithm
  • Database auto-increment primary key
  • Redis atomic increment

UUID

UUID stands for Universally Unique Identifier, which is a native API in Java. A standard UUID contains 32 hexadecimal digits, divided into 5 segments with a hyphen as a separator. The length of each segment is 8 characters, 4 characters, 4 characters, 4 characters, and 12 characters, respectively, with a total of 36 characters, as shown in the figure below. A simple UUID example: 630e4100-e29b-33d4-a635-246652140000.

UUID composition diagram

The advantage of UUID, a unique ID solution, is that it has no external dependencies and is purely locally generated, so its performance is very high. However, its disadvantages are also very obvious:

The field is very long, wasting storage space. UUID is generally a 36-character string. If it is stored as a database primary key, it will greatly increase the storage space of the index.

Non-auto-incremental, which reduces database write performance. UUID is not auto-incremental. If it is used as a database primary key, sequential writes cannot be implemented, which will reduce database write performance.

No business meaning. UUID has no business meaning, and we cannot derive any meaning from it.

Therefore, UUID is more suitable for non-database ID storage, such as generating a local distributed tracing request ID.

Snowflake Algorithm

SnowFlake is an open-source distributed ID generation algorithm developed by Twitter. The idea is to use 64 bits to represent an ID and divide the 64 bits into 4 parts, as shown in the following figure.

Snowflake algorithm unique ID composition diagram

  • The first part: 1 bit. Fixed to 0, indicating a positive integer. The highest bit in binary is the sign bit, 1 indicates a negative number, and 0 indicates a positive number. IDs are all positive integers, so they are fixed to 0.
  • The second part: 41 bits. Indicates the timestamp, accurate to milliseconds, and can be used for 69 years. The timestamp has an auto-increment property.
  • The third part: 10 bits. It represents the 10-bit machine ID, supporting up to 1024 nodes. This part can also be divided into 5-bit datacenterId and 5-bit workerId. DatacenterId represents the computer room ID, and workerId represents the machine ID.
  • The fourth part: 12 bits. Indicates serialization, that is, a series of self-incrementing IDs, which can support the same node to generate up to 4095 ID numbers in the same millisecond.

The advantages of the snowflake algorithm are:

  • It has business meaning and is customizable. Each bit of the ID of the snowflake algorithm has a special meaning, and we can infer the corresponding meaning from the different digits of the ID. In addition, we can also add or delete the digits of each part according to our own needs, so as to realize a customized snowflake algorithm.
  • The ID increases monotonically, which is beneficial to improving write performance. The last part of the ID of the snowflake algorithm is an increasing sequence number, so the ID it generates is increasing. When it is used as the database primary key ID, sequential writing can be achieved, thereby improving write performance.
  • No reliance on third-party systems. The generation method of the snowflake algorithm does not rely on third-party systems or middleware, so it is more stable.
  • The security problem is solved. The IDs generated by the snowflake algorithm are monotonically increasing, but the increment step is not fixed, so the number of generated IDs cannot be inferred from the difference in IDs, thus protecting business privacy.

The snowflake algorithm is almost perfect, but it has a fatal flaw: it is strongly dependent on the machine time. If the system time on the machine is dialed back, that is, the time is slower than the normal time, then duplicate numbers may appear.

In this case, we can maintain a file locally, write the last timestamp, and then compare it with the current timestamp. If the current timestamp is less than the last timestamp, it means that there is a problem with the system time and it should be handled in time.

In general, the snowflake algorithm is not only shorter in length but also has business significance. It can also improve write performance in database storage scenarios. Therefore, the snowflake algorithm for generating distributed unique IDs is popular among people.

At present, many domestic large companies have implemented open source number generators based on the snowflake algorithm, such as Baidu's open source UidGenerator, Meituan's open source Leaf, etc. The core of these snowflake-like algorithms is to divide 64 bits more reasonably, so that they are more suitable for their own scenarios.

Database auto-increment primary key

Speaking of unique ID, we naturally think of the auto-increment primary key of the database, because it is unique.

In the case of low concurrency, we can directly deploy one machine, insert a piece of data into the database table each time an ID is obtained, and then return the primary key ID.

The advantage of this method is that it is very simple and has low implementation cost. In addition, the generated unique ID is also monotonically increasing, which can meet the requirements of database write performance.

But its disadvantage is also very obvious, that is, it is highly dependent on the database. When the database is abnormal, the entire system will be unavailable. Even if a high-availability switch is made, if the data synchronization is inconsistent during the master-slave switch, it may still cause duplicate numbers.

In addition, since it is a stand-alone deployment, its performance bottleneck is limited to the read and write performance of a single MySQL machine, and it is destined to be unable to bear high-concurrency business scenarios.

We can solve the performance problem mentioned above through cluster deployment. We can solve the ID conflict problem after cluster deployment by setting the increment step. For example, if we have 3 machines, we set the increment step to 3, and the ID generation strategy for each machine is:

  • The first machine starts at 0 and increments in steps of 3. The generated IDs are: 0, 3, 6, 9, and so on.
  • The second machine starts from 1 and increments in steps of 3. The generated IDs are: 1, 4, 7, 10, and so on.
  • The third machine starts from 2 and increments in steps of 3. The generated IDs are: 2, 5, 8, 11, and so on.

This method solves the problems of cluster deployment and ID conflicts, and can improve the capacity of concurrent access to a certain extent. However, its disadvantages are also obvious:

We can only rely on stacking machines to improve performance. When the number of requests increases again, we can only stack machines infinitely, which seems to be a physical defense.

Horizontal expansion is difficult. When we need to add a machine, the process is very troublesome. First, we need to deploy the new server and set a new step size. The starting value must be set to an impossible value. After the new server is deployed, we need to deal with the old servers one by one. This process is really painful and can be said to be manual operation and maintenance.

Redis atomic increment

Since Redis is an in-memory database, its powerful performance is very suitable for implementing high-concurrency distributed ID generation. To implement auto-increment ID based on Redis, the main method is to use the INCR command in Redis. This command can increment a number and return the result at the same time, and this operation is an atomic operation.

The distributed ID function is implemented through Redis. Its mode is similar to the self-increment ID through the database, except that the storage medium is changed from hard disk to memory. When a single Redis cannot support concurrent requests, Redis can also be solved by cluster deployment and step size setting.

However, the same problems that database auto-increment primary keys have, Redis auto-increment ID also has, that is, it can only stack machines and has difficulty in horizontal expansion. In addition, compared with the persistence of database storage, Redis is based on memory storage, and the persistence problem needs to be considered, which is also a headache.

Summarize

After looking at so many distributed ID solutions, which one should we choose?

When we make decisions, we should determine the dimensions of the decision. For this issue, the dimensions we should focus on are: R&D cost, concurrency, performance, and operation and maintenance cost.

First of all, UUID is actually inferior to the snowflake algorithm in all aspects. Its only advantage is that JDK comes with its own API. Therefore, if you just want to use it very simply, you can just use UUID directly. After all, you still need to write some implementation code for the snowflake algorithm.

Secondly, the snowflake algorithm is undoubtedly a very good implementation. Compared with UUID, it not only has the advantages of UUID being generated locally and not relying on third-party systems, but also has business significance, can improve writing performance, and solve security issues. However, its disadvantage is that the code of the snowflake algorithm must be implemented, so its R&D cost is slightly higher than that of UUID.

Finally, for the two methods of database self-increment ID and Redis atomic self-increment. The advantage of the database self-increment ID method is also simple and convenient, and does not require too high R&D costs. But its disadvantage is that the supported concurrency is too low, and the subsequent operation and maintenance costs are too high. Therefore, the database self-increment ID method should be suitable for small-scale usage scenarios. The Redis atomic self-increment method is preferred to support high-concurrency scenarios. But the disadvantage is that you need to handle the persistence problem yourself, and the operation and maintenance costs may be relatively high.

I prefer the database auto-increment method. The two methods are very similar, the only difference is the storage medium. Redis atomic auto-increment method is very fast, and a single machine may be several times faster than the database method. However, if you want to consider the issue of persistence, it is too complicated for Redis.

We sort out the above four implementation methods and summarize them into the following comparison table:

Implementation

advantage

shortcoming

Usage scenarios

UUID

High performance, no reliance on third parties, low R&D costs

Long fields waste storage space, IDs do not increment automatically, writing performance is poor, and there is no business significance

Very simple usage scenario (for simple testing)

Snowflake Algorithm

Has business significance, good monotonically increasing write performance, no reliance on third parties, and business security

Strong dependence on machine time

High concurrency, complex business scenarios, and the need to expose IDs to external systems

Database auto-increment ID

Low R&D cost, good monotonically increasing write performance

Depends on database, can only stack machines to improve performance, high maintenance cost

Requires persistence and is not exposed to external systems

Redis atomic increment

High concurrency, monotonically increasing write performance

Depends on Redis, has business problems, and has persistence problems

No persistence required, not exposed to external systems

In general, if you consider long-term use, then operation and maintenance costs and high concurrency must be considered. Under this basic condition, the snowflake algorithm and the database auto-increment ID may be relatively good choices.

References

  • Some thoughts on the global number generator of distributed systems - Nuggets
  • VIP! ! Very good! Leaf—— Meituan Dianping Distributed ID Generation System - Meituan Technology Team
  • (8 messages) Six ways to implement a globally unique number generator_Beihe M's blog - CSDN blog_Number generator
  • VIP! ! Good, expand your horizons! 10 | Number generator: How to ensure the global uniqueness of IDs after sharding? ​

<<:  If the TCP protocol is used, will there be no packet loss?

>>:  Easy-to-understand illustrations of online interview knowledge - Part 1

Recommend

Four Best Practices for Network Cable Management

If a cabling project is to be successful, you fir...

Eight networking trends your business should know about

[[386593]] The coronavirus outbreak broke out in ...

Understand fiber-based LAN architecture

A local area network (LAN) is a computer network ...

It’s time to consider leaf-spine network architecture

With the changes in traffic flows used in modern ...

Learn the history of HTTP in 6 minutes

[[386748]] HTTP/0.9HTTP/0.9 was proposed in 1991 ...

Network Basics: TCP/IP protocol responsibilities and three common models

1. The main responsibilities of TCP/IP protocol ●...