Performance Tuning: The RocketMQ timeout exception that has troubled me for half a year has finally been solved

Performance Tuning: The RocketMQ timeout exception that has troubled me for half a year has finally been solved

[[398458]]

In the intranet environment, there was a timeout problem, and the network said it was not the fault.

The author's understanding of timeouts has become more and more profound with continuous practice in work. The timeout problem encountered by RocketMQ in the production environment has troubled me for nearly half a year. Now I have finally achieved relatively good results, and I would like to share them with you.

This technical sharing involves many areas that the author is not very familiar with, such as the Internet. If there are any errors, please correct them in time to achieve common growth and progress.

1. Network timeout

From time to time, I always receive feedback from the project team that the production environment MQ sending timeout, and the client-related log screenshots are as follows:

Today’s story will begin with a picture.

2. Troubleshooting

2.1 Preliminary analysis

There are two very critical logs in the above figure:

  • invokeSync:wait response timeout exception

Network call timeout

  • recive response,but not matched any request

This log is very critical. It means that although the client times out when getting the result returned by the server, the client can still receive the response from the server in the end. However, the client has given up processing after waiting for enough time.

Regarding the second log, I will elaborate on its operating mechanism. In fact, it is also a programming routine to send multiple requests with one link. A long connection sends two requests to the server in succession. When the client receives the response result from the server, how does it determine which request the response result corresponds to? As shown in the following figure:

The client has multiple threads, and sends two requests, req1 and req2, through one connection. However, the server usually processes multiple threads, and when returning the result, the response of req2 may be received first. How can the client identify which request the data returned by the server corresponds to?

The solution is that before sending a request, the client will generate a request id (requestId) unique to the machine for the request, and will use the Future mode to put the requestId and the Future object into a Map, and then put the requestId into the request body. When the server returns the response result, it will put the request ID into the response result intact. When the client receives the response, it will first output the requestId, then find the corresponding Future object from the cache, wake up the business thread, and notify the caller of the return structure to complete the entire communication.

Therefore, we can see here that the client did not receive the request from the server within the specified time, but was able to receive it in the end. The spearhead is directly pointed to the Broker side. Is it because the Broker has a bottleneck and the processing is very slow?

2.2 Broker-side Processing Bottleneck Analysis

In my "experience", if there is a bottleneck in RocketMQ message sending, various Broker Busy conditions will usually be returned, and you can determine whether the Broker has encountered a bottleneck by tracking the data indicators written to PageCache on the Broker side.

  1. grep "PAGECACHERT" store.log

The result is similar to the following screenshot:

picture

Tips: The above picture is a screenshot from my local computer. When I was analyzing the problem, the time it took to write these data in each Broker in the MQ cluster to PageCache did not exceed 100ms.

It is precisely because of the good pagecache writing data that I propose to shift the contradiction to the network aspect according to the following rough network interaction characteristics:

I also confirmed with the business side that although the message sending return timed out, the message was persisted in MQ and the consumer side could consume it normally. Although theoretically there would be no problem with the LAN, the network group still started network troubleshooting in view of the above phenomenon.

Warm reminder: In the end it turned out that I was led astray.

2.3 Network Analysis

Usually there are two methods for network analysis: netstat and network packet capture.

2.3.1 View Recv-Q and Send-Q with netstat

We can use netstat to focus on two indicators: Recv-Q and Send-Q.

  • Recv-Q

TCP channel receive buffer

  • Send-Q

TCP channel send buffer

In TCP, the functions of Recv-Q and Send-Q are shown in the following figure:

  • When the client calls a network channel, such as NIO's Channel, to write data, the data is first written to the TCP send buffer. If the send buffer is full, the client cannot continue to send requests to the channel. Calling the underlying write method of the Channel from the NIO level will return 0, indicating that the send buffer is full and it is necessary to register a write event. When the send buffer is free, the upper-layer application is notified that a message can be sent.
  • After the data enters the send buffer, it will then reach the target end along the network. It will first enter the target end's receive buffer. If it is hooked up with NIO, the channel's read events will continue. After the application successfully reads the bytes from the receive buffer, it will send an ACK to the sender.
  • After receiving the ACK, the sender will delete the data in the sending buffer. If the receiver does not read the data, the sender will not be able to send data.

Network colleagues are distributed on the client and MQ server and collect netstat data every 500ms. After summarizing the collected results, the following figure appears:

From the client's perspective, there will be a large backlog in the client's Recv-Q, which corresponds to a large backlog in MQ's Send-Q.

From the above communication model, we can infer again whether it is because the client reads bytes from the network too slowly. Since the client is a virtual machine, judging from the netstat results, it is suspected to be a client problem (Note: in fact, it is not a client problem in the end, please don't be distracted).

2.3.2 Network packet capture

In order to be more rigorous, colleagues in the network group also found the following packages:

There is a very suspicious problem here, that is, the sliding window between the client and the server is only 190 bytes, and the return packet of an MQ message is about 250 bytes. In this way, the response packet needs to be transmitted twice before it can be received. At first, I thought this was the main reason, but through other comparisons, I found that the client whose sliding window is not larger than 250 also has timeout, so I judged that this is not the main reason. Later, my network colleagues used various tools to conclude that there is no problem with the network, but a problem with the application.

Come to think of it, it is a local area network after all, so next we will focus on the client's read performance based on the results of netstat.

2.4 Client Network Read Performance Bottleneck Analysis

To prove the performance of reading and writing, I modified the RocketMQ CLient related packages and added the Netty performance collection code. The code screenshot is as follows:

My main idea is to determine how many read operations a channel will perform each time a read event is triggered. If a read event is triggered and many reads need to be triggered, it means that there is indeed a lot of data backlog in a channel and there is a bottleneck in network reading.

But disappointingly, there is no bottleneck in the client's network reading. Some of the collected data are shown below:

By analyzing it with the awk command, it is found that after a read event is triggered, most channels can successfully extract the data in the read buffer area by reading twice. There is no bottleneck in the read aspect. The statistical analysis of awk execution is shown in the following figure:

Then the finger will be pointed at Broker again. Is it slower to write to the cache?

2.5 Network bottleneck on the broker side

According to the above analysis, the Broker server writes to pagecache very quickly, but Dimension does not monitor the process of writing the response results to the network. Is it possible that the response results are not written in time, and a large amount of them are accumulated in the Netty write cache area, resulting in failure to write them to the TCP send buffer in time, causing a timeout.

The author originally wanted to modify its code and monitor the write performance of the server from the Netty level, but considering the high risk, he did not modify the code for the time being, but carefully read the code of RocketMQ encapsulation Netty again. Before reading the source code this time, I always thought that RocketMQ's network layer basically did not need parameter optimization, because the company's servers all have 64 cores, and Netty's IO threads default to the number of CPU cores. However, after reading the source code, I found that there are two thread parameters related to IO in RocketMQ:

  • serverSelectorThreads

The default value is 3.

  • serverWorkerThreads

The default value is 8.

serverSelectorThreads, in Netty, is WorkGroup, the so-called IO thread pool. Each thread pool will hold a Selector object in NIO for event selection. All channels will be registered in these three threads in turn. All Channels bound to one thread will perform read and write operations serially, that is, all channels read data from the TCP buffer and write data to the send buffer in this thread.

Our MQ server is configured with a CPU core of 64C or above. Using three threads to do this is obviously a bit too "small". This parameter can be adjusted.

serverWorkerThreads, in Netty's thread model, by default, event propagation (encoding, decoding) is in the IO thread, that is, the thread where the Selector object mentioned above is located. In order to reduce the pressure on the IO thread, RocketMQ defines a separate thread pool for event propagation, that is, the IO thread is only responsible for network reading and writing, decoding the read data, encoding the written data, and other operations, and is placed in an independent thread pool. The number of threads in the thread pool is specified by serverWorkerThreads.

Seeing this, I was finally excited and felt that I was getting closer and closer to the truth. Referring to Netty, I set the IO threads to twice the number of CPU cores. My first wave of parameter optimization was set to serverSelectorThreads=16 and serverWorkerThreads=32, and verified it in the production environment.

After more than a month of verification, as the number of cluster machines decreased (reduced after Double Eleven), only a very small number of message sending timeouts occurred, which can basically be ignored.

3. Summary

This article details the author's efforts to troubleshoot MQ send timeouts, and ultimately locates the unreasonable setting of network communication thread model parameters on the RocketMQ server.

The reason why it took so long is actually worth reflecting on. I was misled by my "experience". In fact, I used to attribute the timeout directly to the network. The reason was that RocketMQ's writing PageCache data was very good, and basically there was no write request that exceeded 100ms. I thought that the RocketMQ server had no performance bottleneck and did not consider the entire network communication layer.

This article is reprinted from the WeChat public account "Middleware Interest Circle", which can be followed through the following QR code. To reprint this article, please contact the Middleware Interest Circle public account.

<<:  The three major operators were forced to delist from the US: the impact was not significant but the intention was obvious

>>:  Diagram: Page replacement algorithm

Blog    

Recommend

Do you know several commonly used communication methods in microservices?

introduction Microservice architecture has become...

DesiVPS: $15/year KVM-1GB/15GB/2.5TB/Los Angeles & Netherlands Data Center

Tribe shared information about DesiVPS last year....

IDC: Ten predictions for China's ICT market in 2021

On November 10, IDC recently released the "T...

How to set up a backup internet connection for your home office

For work from home, one factor that businesses mu...

Unified Communications Market Trends Drive Spending Growth

Investment in on-premises unified communications ...

5G is coming. How long can 4G hold out? Will WiFi still have the last laugh?

The mobile phones in Carmen, Mong Kok seem to hav...