PrefaceThe daily bug troubleshooting series is all about troubleshooting simple bugs. The author will introduce some simple tricks for troubleshooting bugs here and accumulate materials at the same time. Bug sceneI recently encountered a problem where the number of connections on a machine suddenly dropped to a few hundred after reaching a certain number of connections (about 45,000). The manifestation in the application was a large number of connection errors and the system lost response, as shown in the following figure: Ideas Idea 1: The first step is to suspect that the code is wrong. The author took a look and found that a mature framework is used instead of a connection operated by myself, so the problem with the code should be minor. Monitoring informationWith the above ideas, we can observe the relevant monitoring information. CPU monitoring: The CPU consumption is very high, reaching nearly 70%, but the failure to obtain the CPU will generally only lead to a slower response, which does not match the problem phenomenon. Bandwidth monitoring: The bandwidth utilization rate has reached 50%, which is not high. Memory monitoring: A large amount of memory is indeed used, and the RSS has reached 26G, but compared with 128G of memory, this consumption is obviously not likely to become a bottleneck. Well, after looking at these three data, it is found that the system's resource consumption has not yet reached a bottleneck. However, the author suspected from the beginning that the use of memory may have triggered a special bottleneck. Because only when the memory resources cannot be applied for, the TCP connection may directly report an error and then drop the connection. TCP monitoring informationWhen traditional monitoring is no longer sufficient to analyze our problems, the author directly takes out the most effective statistical command for TCP problems and uses the magic weapon: I carefully observed the TCP and TCP memory related output items in the output of this command, and found something very unusual: This output is exactly what I guessed about the memory limit. The TCP memory is insufficient, which causes the memory request to fail when reading or writing data, and then the TCP connection itself is dropped. Modify kernel parametersBecause I have read the Linux TCP source code and all its adjustable kernel parameters in detail, I have an idea of TCP memory limit. With GPT, I just need to know the general direction. I can directly ask GPT and it will give me the answer, which is the parameter tcp_mem. These three values represent different memory usage strategies of TCP under different thresholds. The unit is page, which is 4KB. You can ask GPT for a detailed explanation, so I won’t go into details here. The core is that when the total memory consumed by TCP is greater than the third value, which is 3144050 (12G, accounting for 9.35% of 128G memory), TCP starts to drop the connection because it cannot apply for memory. The corresponding application does consume a lot of memory for each TCP connection because each request is as high as several MB. picture Knowing this is the problem is very simple, just increase tcp_mem: The system remains stable after adjustmentAfter the corresponding kernel adjustment, the number of system connections exceeded 5W and remained stable. At this time, we observed the output of the relevant TCP memory consumption page: From this output, we can see that after the system runs smoothly, the number of memory pages mem used normally is 4322151, which is much larger than the previous 3144050. This also verifies the author's judgment from the side. Corresponding kernel stackRecord the corresponding Linux kernel stack here It can be seen that when the allocated memory is greater than the relevant memory limit, the Linux Kernel will directly drop the TCP connection. SummarizeAfter understanding the bug scene clearly, I spent about 20 minutes to locate the TCP memory bottleneck problem, and then found the relevant solution very quickly with the help of GPT. It has to be said that GPT can greatly speed up our search process. I personally feel that it can replace search engines to a large extent. However, the prompts fed to GPT still need to be constructed through the bug scene and certain experience. It cannot replace your thinking, but it can greatly speed up the retrieval of information. |
<<: Confessions of a "colorful light": the road to change after entering 100,000 rooms
>>: From UML to SysML: The language journey of describing complex systems
On June 10, 2021, the "Data Security Law of ...
Beijing, China, May 14, 2018 – This week, F5 (NAS...
[51CTO.com original article] The interview with M...
Recently, IT giant Microsoft announced a partners...
In March 2020, the Ministry of Industry and Infor...
Behind the strategic choices of technology compan...
Network monitoring complements network management...
According to a new research report released by St...
Tencent Cloud launched a Spring Festival shopping...
According to Sina Technology, at the 2021 Technol...
"Users asked for an optical network solution...
In recent years, China's microwave test and m...
[[437090]] Let's first look at a set of data....
[51CTO.com original article] On August 23, the th...
Nowadays, the mobile phone screen has become the ...