Daily Bug Troubleshooting-All Connections Suddenly Closed

Daily Bug Troubleshooting-All Connections Suddenly Closed

Preface

The daily bug troubleshooting series is all about troubleshooting simple bugs. The author will introduce some simple tricks for troubleshooting bugs here and accumulate materials at the same time.

Bug scene

I recently encountered a problem where the number of connections on a machine suddenly dropped to a few hundred after reaching a certain number of connections (about 45,000). The manifestation in the application was a large number of connection errors and the system lost response, as shown in the following figure:

Ideas

Idea 1: The first step is to suspect that the code is wrong. The author took a look and found that a mature framework is used instead of a connection operated by myself, so the problem with the code should be minor.
Idea 2: Then I began to suspect that it was a kernel limitation, such as the file descriptor limit, but there is a contradiction. If the kernel limits the number of connections, the number of connections should not increase after reaching a certain level, rather than a sharp drop in the number of connections.
Idea 2.1: Going further, I began to think that it is very likely that the limitation of some indirect resource caused all connections to fail to obtain this resource after reaching this bottleneck, resulting in all errors. Combined with the fact that the resources consumed by TCP connections are nothing more than CPU/memory/bandwidth.

Monitoring information

With the above ideas, we can observe the relevant monitoring information. CPU monitoring: The CPU consumption is very high, reaching nearly 70%, but the failure to obtain the CPU will generally only lead to a slower response, which does not match the problem phenomenon. Bandwidth monitoring: The bandwidth utilization rate has reached 50%, which is not high. Memory monitoring: A large amount of memory is indeed used, and the RSS has reached 26G, but compared with 128G of memory, this consumption is obviously not likely to become a bottleneck. Well, after looking at these three data, it is found that the system's resource consumption has not yet reached a bottleneck. However, the author suspected from the beginning that the use of memory may have triggered a special bottleneck. Because only when the memory resources cannot be applied for, the TCP connection may directly report an error and then drop the connection.

TCP monitoring information

When traditional monitoring is no longer sufficient to analyze our problems, the author directly takes out the most effective statistical command for TCP problems and uses the magic weapon:

 # 这条命令详细的输出了tcp连接的各种统计参数,很多问题都可以通过其输出获得线索netstat -s

I carefully observed the TCP and TCP memory related output items in the output of this command, and found something very unusual:

 ... TcpExt: TCP ran low on memoery 19 times ......

This output is exactly what I guessed about the memory limit. The TCP memory is insufficient, which causes the memory request to fail when reading or writing data, and then the TCP connection itself is dropped.

Modify kernel parameters

Because I have read the Linux TCP source code and all its adjustable kernel parameters in detail, I have an idea of ​​TCP memory limit. With GPT, I just need to know the general direction. I can directly ask GPT and it will give me the answer, which is the parameter tcp_mem.

 cat /proc/sys/net/ipv4/tcp_mem 1570347 2097152 3144050

These three values ​​represent different memory usage strategies of TCP under different thresholds. The unit is page, which is 4KB. You can ask GPT for a detailed explanation, so I won’t go into details here. The core is that when the total memory consumed by TCP is greater than the third value, which is 3144050 (12G, accounting for 9.35% of 128G memory), TCP starts to drop the connection because it cannot apply for memory. The corresponding application does consume a lot of memory for each TCP connection because each request is as high as several MB.
Once the memory consumption exceeds the limit, the TCP connection will be forced to drop by the kernel, which explains why almost all connections drop in a short period of time, because they are constantly applying for memory, and when the critical threshold is reached, all errors are reported, and then all connections in the entire system are closed, causing the system to lose response. As shown in the following figure:

picture

Knowing this is the problem is very simple, just increase tcp_mem:

 cat /proc/sys/net/ipv4/tcp_mem 3570347 6097152 9144050

The system remains stable after adjustment

After the corresponding kernel adjustment, the number of system connections exceeded 5W and remained stable. At this time, we observed the output of the relevant TCP memory consumption page:

 cat /proc/net/sockstat TCP: inuse xxx orphan xxx tw xxx alloc xxxx mem 4322151

From this output, we can see that after the system runs smoothly, the number of memory pages mem used normally is 4322151, which is much larger than the previous 3144050. This also verifies the author's judgment from the side.

Corresponding kernel stack

Record the corresponding Linux kernel stack here

 tcp_v4_do_rcv |->tcp_rcv_established |->tcp_data_queue |->tcp_data_queue |->tcp_try_rmem_schedule |->sk_rmem_schedule |->sk_rmem_schedule |->__sk_mem_raise_allocated |-> /* Over hard limit. */ if (allocated > sk_prot_mem_limits(sk, 2)) goto suppress_allocation; |->goto drop: tcp_drop(sk,skb)

It can be seen that when the allocated memory is greater than the relevant memory limit, the Linux Kernel will directly drop the TCP connection.

Summarize

After understanding the bug scene clearly, I spent about 20 minutes to locate the TCP memory bottleneck problem, and then found the relevant solution very quickly with the help of GPT. It has to be said that GPT can greatly speed up our search process. I personally feel that it can replace search engines to a large extent. However, the prompts fed to GPT still need to be constructed through the bug scene and certain experience. It cannot replace your thinking, but it can greatly speed up the retrieval of information.

<<:  Confessions of a "colorful light": the road to change after entering 100,000 rooms

>>:  From UML to SysML: The language journey of describing complex systems

Recommend

F5 Launches F5 Advanced WAF for Multi-Cloud Application Security

Beijing, China, May 14, 2018 – This week, F5 (NAS...

Juniper Networks' Shaowen Ma: The best SDN controller for cloud computing

[51CTO.com original article] The interview with M...

How does network monitoring work?

Network monitoring complements network management...

The number of 5G mobile phones will reach 250 million. Is this good news?

According to a new research report released by St...