Hello everyone, I am captain. If you find the following article helpful, please pay attention to it! In the previous article, we talked about how to ensure that messages are not lost, and how to ensure that messages are not lost from the three roles of producer, broker, and consumer. In fact, it is impossible to guarantee that messages are not lost 100%, and this is of course an extreme case. The topic we are going to talk about in this article is the handling of message accumulation. In fact, this topic is quite large, because message accumulation is really a headache. When the accumulated amount is large, it is really a very frustrating problem. However, this is a time that really tests everyone's ability to handle problems calmly. Let’s analyze the relevant issues together!
Generally, the best way to solve this kind of urgent problem is to temporarily expand capacity and consume data at a faster rate. 1. Temporarily create a new Topic, and then adjust the number of queues to 10 or 20 times the original number, depending on the accumulation situation. 2. Then write a consumer program for temporary message distribution. This program is deployed to consume the backlog of messages. What it consumes is the newly created Topic. After consumption, no time-consuming processing is performed. It only needs to directly and evenly poll these messages and write them into the temporarily created queue. 3. Then add the corresponding multiples of machines to deploy real consumers, pay attention to the Topic here, and then let these consumers actually consume the messages in these temporary queues. I don’t know if you understand it. It’s a very simple principle. Let me give you a vivid example. If a topic is blocked, create a new topic to divert the traffic, temporarily expand the queue resources and consumer resources by 10 times, evenly distribute the messages to these newly added queue resources and consumer resources, consume the messages at 10 times the normal speed, and when the accumulated messages are consumed, you can restore to the original deployment architecture. This is only used to temporarily solve the message accumulation caused by some abnormal situations. If the message is often blocked, it is time to consider completely enhancing the deployment architecture of the system.
In rabbitmq, you can set an expiration time TTL, which is the same as the expiration time of Redis. If the message is accumulated in the queue for more than a certain period of time, it will be cleared by rabbitmq and the data will be gone. This may result in a large amount of data loss. In this case, the above solution is not suitable. You can adopt a batch redirection solution to solve it. When the system traffic is relatively low, use a program to query the lost data, and then resend the message to MQ to restore the lost data. This can also be considered a compensation task, which is generally used to compensate for scheduled batch runs.
The accumulation of messages is ultimately caused by the mismatch between the speed at which producers produce messages and the speed at which consumers consume them, and the inconsistency between the speed of input and consumption. Perhaps there was a sudden promotion, which caused a surge in the system's business volume, resulting in a surge in producers sending messages, and the consumption speed could not keep up. It is also possible that the consumer fails and retries frantically, or the consumer's spending power is simply too low. RocketMQ loads messages according to queues. If a machine in the consumer cannot process the message queue on that machine in time due to hardware reasons, the entire message queue will be backlogged. RocketMQ is divided into publishers and subscribers. Both sides have load balancing strategies. The default is to use average distribution. Producer messages are sent to the message queue in a round-robin manner, and the broker evenly distributes these queues to the subscriber cluster belonging to the same group id.
If one of the machines slows down, it may be that the messages assigned to the Queue on this machine cannot be processed in time due to reasons such as the machine hardware, system, remote RPC call, or Java GC. The message load of the RocketMQ version of the message queue is maintained at the queue granularity, so the messages on the entire queue will accumulate.
We know that the most fundamental reason is the mismatch between production and consumption speeds. If this problem occurs frequently, it is caused by the system architecture, and we need to consider increasing the number of consumers. If it is caused by a temporary situation such as a promotion, the system should be able to digest it quickly and the accumulation time will not be very fast. If the promotion time is very long and the high traffic time is sustained for a long time, there is no way and you still have to add machines. This kind of message accumulation problem often occurs. You need to first locate the reason for the full consumption. It may also be a code bug that leads to multiple retries. If it is a bug, fix the bug and optimize the consumption logic. Secondly, we need to consider horizontal expansion, increase the number of Topic queues and the number of consumers. When increasing these two, we need to consider the balance between the two. The number of queues must be increased, otherwise the newly added number of consumers will lead to the embarrassing situation of no message consumption. A queue in a topic will only be allocated to one consumer. When the number of consumers exceeds the number of queues, the excess consumers will have no messages to consume.
The storage of messages always exists in the CommitLog file. As we all know, CommitLog exists in files, and RocketMQ is designed to only allow sequential writing, which means that all messages are written to this file sequentially. The size of each message is not fixed, so it is almost impossible to delete messages in units, and the logic is extremely complicated. Once a message is consumed, it will not be cleared immediately, but will still exist in the CommitLog file. Then the question is, if the message is not deleted, how does RocketMQ know which messages have been consumed and which have not been consumed? The answer is that the client will maintain a message offset. After the client pulls the message, the broker will return a next pull position along with the response body, and the consumer will update its next pull position.
After the messages are stored in the file, they will also be cleaned up, but this cleanup will only happen when any of the following conditions is met, and the CommitLog message files will be deleted in batches.
Note: If the disk space reaches the dangerous watermark (90% by default), the broker will refuse to write services for its own protection.
The default size of the CommitLog file is 1GB. When cleaning it, it is considered a large file operation and there is also IO pressure. I will briefly talk about a few advantages of designing this file. Of course, there are definitely some others. Only one message file needs to be saved : If a message needs to be consumed by multiple consumer groups, only one copy of the message needs to be saved, and the consumption progress is saved separately, which makes it easier to support powerful message storage capabilities. Support backtracking : The decision on the message consumption location is placed on the client. As long as the message is still there, it can be consumed, so RocketMQ supports backtracking consumption. Just like watching a video, you can adjust the camera to the front and watch the video again. Support message indexing service : RocketMQ has an index file. As long as the message still exists in CommitLog, it can be searched out, which is convenient for troubleshooting. |
<<: South Korea's largest telecom operator's server failure caused widespread Internet outages
>>: How to smoothly go online after MySQL table sharding?
On August 25, Finnish telecommunications equipmen...
After more than two years of development, 5G has ...
[[385335]] This article is reprinted from the WeC...
During an internal Java service audit, we discove...
During the 2021 Global Mobile Broadband Forum, th...
RocketMQ master-slave replication is one of Rocke...
Improving battery life has been a challenge for a...
In 2022, virtual enterprises can achieve digital ...
2 years, 350 million. This is the answer given af...
Last month we shared the news that Hosteons was g...
Background Recently, a certain enterprise has rec...
If you've ever bought a Wi-Fi router, you pro...
Fiberia.io is a new website, from the same compan...
Data protection systems can sometimes seem like t...
Recently, H3C, a leading manufacturer in the IP n...