In the intranet environment, there was a timeout problem, and the network said it was not the fault. The author's understanding of timeouts has become more and more profound with continuous practice in work. The timeout problem encountered by RocketMQ in the production environment has troubled me for nearly half a year. Now I have finally achieved relatively good results, and I would like to share them with you. This technical sharing involves many areas that the author is not very familiar with, such as the Internet. If there are any errors, please correct them in time to achieve common growth and progress. 1. Network timeoutFrom time to time, I always receive feedback from the project team that the production environment MQ sending timeout, and the client-related log screenshots are as follows: Today’s story will begin with a picture. 2. Troubleshooting2.1 Preliminary analysis There are two very critical logs in the above figure:
Network call timeout
This log is very critical. It means that although the client times out when getting the result returned by the server, the client can still receive the response from the server in the end. However, the client has given up processing after waiting for enough time. Regarding the second log, I will elaborate on its operating mechanism. In fact, it is also a programming routine to send multiple requests with one link. A long connection sends two requests to the server in succession. When the client receives the response result from the server, how does it determine which request the response result corresponds to? As shown in the following figure: The client has multiple threads, and sends two requests, req1 and req2, through one connection. However, the server usually processes multiple threads, and when returning the result, the response of req2 may be received first. How can the client identify which request the data returned by the server corresponds to? The solution is that before sending a request, the client will generate a request id (requestId) unique to the machine for the request, and will use the Future mode to put the requestId and the Future object into a Map, and then put the requestId into the request body. When the server returns the response result, it will put the request ID into the response result intact. When the client receives the response, it will first output the requestId, then find the corresponding Future object from the cache, wake up the business thread, and notify the caller of the return structure to complete the entire communication. Therefore, we can see here that the client did not receive the request from the server within the specified time, but was able to receive it in the end. The spearhead is directly pointed to the Broker side. Is it because the Broker has a bottleneck and the processing is very slow? 2.2 Broker-side Processing Bottleneck Analysis In my "experience", if there is a bottleneck in RocketMQ message sending, various Broker Busy conditions will usually be returned, and you can determine whether the Broker has encountered a bottleneck by tracking the data indicators written to PageCache on the Broker side.
The result is similar to the following screenshot: picture Tips: The above picture is a screenshot from my local computer. When I was analyzing the problem, the time it took to write these data in each Broker in the MQ cluster to PageCache did not exceed 100ms. It is precisely because of the good pagecache writing data that I propose to shift the contradiction to the network aspect according to the following rough network interaction characteristics: I also confirmed with the business side that although the message sending return timed out, the message was persisted in MQ and the consumer side could consume it normally. Although theoretically there would be no problem with the LAN, the network group still started network troubleshooting in view of the above phenomenon. Warm reminder: In the end it turned out that I was led astray. 2.3 Network Analysis Usually there are two methods for network analysis: netstat and network packet capture. 2.3.1 View Recv-Q and Send-Q with netstat We can use netstat to focus on two indicators: Recv-Q and Send-Q.
TCP channel receive buffer
TCP channel send buffer In TCP, the functions of Recv-Q and Send-Q are shown in the following figure:
Network colleagues are distributed on the client and MQ server and collect netstat data every 500ms. After summarizing the collected results, the following figure appears: From the client's perspective, there will be a large backlog in the client's Recv-Q, which corresponds to a large backlog in MQ's Send-Q. From the above communication model, we can infer again whether it is because the client reads bytes from the network too slowly. Since the client is a virtual machine, judging from the netstat results, it is suspected to be a client problem (Note: in fact, it is not a client problem in the end, please don't be distracted). 2.3.2 Network packet capture In order to be more rigorous, colleagues in the network group also found the following packages: There is a very suspicious problem here, that is, the sliding window between the client and the server is only 190 bytes, and the return packet of an MQ message is about 250 bytes. In this way, the response packet needs to be transmitted twice before it can be received. At first, I thought this was the main reason, but through other comparisons, I found that the client whose sliding window is not larger than 250 also has timeout, so I judged that this is not the main reason. Later, my network colleagues used various tools to conclude that there is no problem with the network, but a problem with the application. Come to think of it, it is a local area network after all, so next we will focus on the client's read performance based on the results of netstat. 2.4 Client Network Read Performance Bottleneck Analysis To prove the performance of reading and writing, I modified the RocketMQ CLient related packages and added the Netty performance collection code. The code screenshot is as follows: My main idea is to determine how many read operations a channel will perform each time a read event is triggered. If a read event is triggered and many reads need to be triggered, it means that there is indeed a lot of data backlog in a channel and there is a bottleneck in network reading. But disappointingly, there is no bottleneck in the client's network reading. Some of the collected data are shown below: By analyzing it with the awk command, it is found that after a read event is triggered, most channels can successfully extract the data in the read buffer area by reading twice. There is no bottleneck in the read aspect. The statistical analysis of awk execution is shown in the following figure: Then the finger will be pointed at Broker again. Is it slower to write to the cache? 2.5 Network bottleneck on the broker side According to the above analysis, the Broker server writes to pagecache very quickly, but Dimension does not monitor the process of writing the response results to the network. Is it possible that the response results are not written in time, and a large amount of them are accumulated in the Netty write cache area, resulting in failure to write them to the TCP send buffer in time, causing a timeout. The author originally wanted to modify its code and monitor the write performance of the server from the Netty level, but considering the high risk, he did not modify the code for the time being, but carefully read the code of RocketMQ encapsulation Netty again. Before reading the source code this time, I always thought that RocketMQ's network layer basically did not need parameter optimization, because the company's servers all have 64 cores, and Netty's IO threads default to the number of CPU cores. However, after reading the source code, I found that there are two thread parameters related to IO in RocketMQ:
The default value is 3.
The default value is 8. serverSelectorThreads, in Netty, is WorkGroup, the so-called IO thread pool. Each thread pool will hold a Selector object in NIO for event selection. All channels will be registered in these three threads in turn. All Channels bound to one thread will perform read and write operations serially, that is, all channels read data from the TCP buffer and write data to the send buffer in this thread. Our MQ server is configured with a CPU core of 64C or above. Using three threads to do this is obviously a bit too "small". This parameter can be adjusted. serverWorkerThreads, in Netty's thread model, by default, event propagation (encoding, decoding) is in the IO thread, that is, the thread where the Selector object mentioned above is located. In order to reduce the pressure on the IO thread, RocketMQ defines a separate thread pool for event propagation, that is, the IO thread is only responsible for network reading and writing, decoding the read data, encoding the written data, and other operations, and is placed in an independent thread pool. The number of threads in the thread pool is specified by serverWorkerThreads. Seeing this, I was finally excited and felt that I was getting closer and closer to the truth. Referring to Netty, I set the IO threads to twice the number of CPU cores. My first wave of parameter optimization was set to serverSelectorThreads=16 and serverWorkerThreads=32, and verified it in the production environment. After more than a month of verification, as the number of cluster machines decreased (reduced after Double Eleven), only a very small number of message sending timeouts occurred, which can basically be ignored. 3. SummaryThis article details the author's efforts to troubleshoot MQ send timeouts, and ultimately locates the unreasonable setting of network communication thread model parameters on the RocketMQ server. The reason why it took so long is actually worth reflecting on. I was misled by my "experience". In fact, I used to attribute the timeout directly to the network. The reason was that RocketMQ's writing PageCache data was very good, and basically there was no write request that exceeded 100ms. I thought that the RocketMQ server had no performance bottleneck and did not consider the entire network communication layer. This article is reprinted from the WeChat public account "Middleware Interest Circle", which can be followed through the following QR code. To reprint this article, please contact the Middleware Interest Circle public account. |
>>: Diagram: Page replacement algorithm
introduction Microservice architecture has become...
Some media reported that "the first tens of ...
[[336113]] Recently, according to Gartner's l...
Tribe shared information about DesiVPS last year....
[China, March 9, 2020] Today, Huawei held an onli...
Next, let me share some information about DiyVM, ...
On November 10, IDC recently released the "T...
After a long wait and beta testing, Google offici...
According to customer service, the 5G network tra...
For work from home, one factor that businesses mu...
Investment in on-premises unified communications ...
As the core of the new generation of communicatio...
[[415906]] Recently, the Information and Communic...
The mobile phones in Carmen, Mong Kok seem to hav...
We know that the direct connection between two co...