Introduction:
1. Introduction to production failure scenarios Business Introduction: This service mainly provides external proxy interfaces. Most of the interfaces will call third-party interfaces to obtain data, aggregate it, and then provide it to the client for use. One evening, when the system was at its peak, the project team members were enjoying their "overtime meal". Just as they put the food in their mouths, they received emails and text messages warning them at the same time. There was a server interface timeout. I occasionally receive similar alarms, sometimes due to network fluctuations and other reasons. I am really sorry that I always let the "network" classmates take the blame for no reason:). However, this time the alarm did not converge and continued for more than ten minutes, which made me feel bad. When I clicked the URL interface link in the email, the page kept spinning in circles and the response was very slow. What a tragedy! At this moment, I silently pushed the lunch box aside, unable to bear to look at it :( Basic process of problem location: (1) Determine the scope of impact There are multiple servers behind this service, and only one server has crashed, so it will not have much impact on users. Temporarily remove it from the registration center, and do not let the client continue to retry to this server, so as to preserve the accident scene. (2) Check monitoring indicators Check the access volume of the interface service. Because it is the evening peak, the user access volume will be greater than that of other time periods, but this access volume does not seem to have a particularly obvious sudden increase compared with the same period of time. The CPU, memory, IO, and network indicators of the server observed on the monitoring also seem to be normal. (3) Server troubleshooting Log in to the server and check the server CPU, memory and other indicators in combination with monitoring. Check the service logs and they are all normal. There is no special abnormal log output, Exception or OOM and other abnormalities. What we saw was that the interface service could no longer respond normally. The application ran on the JVM, and we quickly checked it using the common commands that come with the JDK. The following command prints the stack information:
The statistical results are as follows: If you encounter thread states such as java.lang.Thread.State: WAITING (parking) and java.lang.Thread.State: WAITING (on object monitor), you should pay attention, as they are usually caused by problems in the program itself. According to java.lang.Thread.State: WAITING, we checked the stack information in jstack.log and found a large number of logs of calling HttpClient tool class requests waiting for suspension. The specific stack information will be analyzed in detail below. These service calls are directly called through the HttpClient tool, which encapsulates Spring RestTemplate. The underlying layer also calls the Apache HttpClient tool class to implement service calls. In addition to seeing the above jstack log exception, the network status on the server was also checked, which is also a common troubleshooting method used by operation and maintenance students. Attach the command to count the network connection status:
Statistical results:
Note here that we see a large number of connections in the CLOSE_WAIT state in the server's strange network connection statistics. Moreover, this state will still exist when you execute the statistical command again after a period of time, that is, it will not be released. It seems that the problem is quite serious. Further speculation is that the appearance of these CLOSE_WAIT states is probably related to the slow response of the interface. At the same time, it is also related to the HttpClient thread blocking that appears in the Java stack information, which is used as a breakthrough point for analysis. Let's first understand the CLOSE_WAIT state. The CLOSE_WAIT state is in the process of TCP network disconnection. When the client initiates a disconnection request, the server receives the disconnection request for the first time and replies with a confirmation message. After that, it is in the CLOSE_WAIT state. When the server completes the response processing, it will reply to the client with a network packet, and the normal connection will be closed. 2. TCP three-way handshake process Although the CLOSE_WAIT state is in the four-way handshake process of the TCP network connection, it is still necessary to first understand the three-way handshake of the TCP network connection, because it is the first thing the requesting server needs to do, which is to establish a TCP connection. Technology comes from life. We can use an example from daily life to understand the TCP three-way handshake process. For example, you want to chat with a friend you haven't seen for a long time on WeChat:
If you always start a conversation with friends like this, wouldn’t you feel tired? In fact, the above process can well understand the TCP three-way handshake process. Let's consider Xiaodong as the "client" and Xiaosheng as the "server". Xiaodong is a programmer and works in IT. Xiaosheng is starting a business in his hometown. Understand the TCP three-way handshake process:
The states mentioned above, syn_sent syn_rcvd established, are the key states involved in the TCP three-way handshake process. Let’s take a look at the previous picture to understand it intuitively: 3. TCP disconnection four wave process Xiaodong and Xiaosheng’s heated chat ended. It was already late and they had been busy all day, so they needed to rest. Xiaodong has to get up early tomorrow for work, so he told Xiaosheng in advance:
Corresponding understanding of the TCP four-wave process:
Why does Xiaodong have to wait in the time_wait state? Because they followed the "old rules", they had to make sure Xiaosheng actually received the final message before they could finally end the chat. The standard waiting time for the time_wait state is 4 minutes. During this period, the TCP network connection socket resources (ports) established by Xiaodong and Xiaosheng cannot be used by others, nor can they be recycled and reused by the system. If Xiaosheng does not receive any feedback, he will continue to ask "re-send fin2 message" until Xiaodong successfully sends the ack message. Both parties officially close the chat channel, release the port resources, and close the connection. The 4 minutes of waiting is 2 MSLs, and each MSL is 2 minutes. MSL stands for maximium segment lifetime, which is the maximum message lifetime. This time is specified by the official RFC protocol. The previous picture provides a further intuitive understanding: 4. Analyze source code with Java stack After analyzing the TCP four-wave process, when the server receives the TCP disconnect request packet, it needs to reply a confirmation message to the client. At this time, the server status is in CLOSE_WAIT state. We know the location of this state in the network connection. Combined with the problems mentioned above, a large number of threads are blocked on the HttpClient call. The thread state is WAITING. Statistics on the server show that a large number of network connections in the CLOSE_WAIT state cannot be released. Threads are a relatively valuable resource in the JVM process. If a large number of threads are always in a waiting or blocked state, all threads will gradually be occupied, causing the service to be unable to respond normally. Let's analyze the specific reasons through the Java thread stack information and the source code. Find the first key stack log:
A large number of the above logs appear in the stack logs, which are basically called by the HttpClient tool class. All threads are in the java.lang.Thread.State: WAITING (parking) state. The WAITING (parking) thread hangs because the interface service calls a large number of third-party interfaces to obtain HTTP connections, but it cannot obtain them and can only wait. The HttpClientUtil tool class inherits from Spring RestTemplate and has some parameters, retry mechanisms, and proxy customization. Its package path is located in org.springframework.web.client.RestTemplate. The class diagram is shown below: Create HttpClient tool sample code:
HttpClientFactoryBean inherits from AbstractFactoryBean and overrides the getObjectType() and createInstance() methods. The class diagram is shown below: Some example methods of HttpClientFactoryBean:
According to the stack information, we can also see that the connection is obtained from the leaseConnection() method of PoolingHttpClientConnectionManager. Then we can look at the source code in detail to see why it is not successfully obtained. How to find the source code? Through the call stack link in the stack information, you can easily find which classes and methods are passed and which line of code. According to the log in jstack above:
As the name suggests, the AbstractConnPool abstract connection pool class is used to call the getPoolEntryBlocking blocking method to obtain the connection, line 380 of the code. View source code:
Find the source code on line 380, where the await() method of condition is called:
Here, Condition under the concurrent package is used to implement a multi-threaded coordination communication mechanism. After the await() method is called, the current thread will be added to the Condition waiting queue, which is a FIFO structure queue. At the same time, the current thread lock is released. If it is not released, other threads will not be able to obtain the lock, which may cause a deadlock. await() method source code:
The current thread joins the Condition waiting queue structure diagram: When the signalAll() or signal() method is called through Condition, the first node of the waiting queue is obtained, moved to the synchronization queue, and the thread in the node is awakened using LockSupport. The node is moved from the waiting queue to the AQS synchronization queue as shown in the following structure diagram: In the AbstractConnPool class, we found the release() code for releasing the connection. The release() method source code is as follows:
We can see that when releasing the connection, calling this.condition.signalAll(); the call of the signalAll() method will wake up all the waiting queue threads. Although all are woken up, only one thread can get the lock, and other threads still need to spin and wait. The signalAll() method source code is as follows:
After analyzing the underlying code, we look back and see that it is because the await() method without parameters of condition is called, and the Http connection cannot be obtained, and the tomcat thread is occupied. At the beginning of the getPoolEntryBlocking() method, there is a section of code that cannot be ignored:
This code is a timeout at first glance. I guess that the code here should be the waiting time when getting a connection from the connection pool. See below the getPoolEntryBocking() method:
If deadline is not empty, the awaitUtil(deadline) method of condition will be called. The awaitUtil(deadline) method indicates that if the deadline is reached and the thread is not awakened, it will automatically wake up and join the AQS synchronization queue to acquire the lock. We can continue to look for the caller based on the stack information and see the source of the timeout in the deadline.
MainClientExec#execute() method source code:
The timeout here, namely connectionRequestTimeout, is the timeout value for calculating the deadline, which confirms our guess. The initial configuration parameters of the HttpClient tool are initialized, and the connectionRequestTimeout parameter is not configured. This parameter is also very critical. If it is not set, and the thread suspended by park has not been awakened by the signal, it will continue to wait. Therefore, this parameter must be set. The deadline here is an absolute time. When it is not empty, the awaitUtil(deadline) method of the condition will be called. Even if it is not awakened by the signal, it will automatically wake up and compete for the lock, instead of being blocked without being awakened. Moreover, the design of the deadline variable in the awaitUtil(deadline) method is similar to that in the awaitNanos(long nanosTimeout) method. When the set timeout period is reached and no signal is given, the success variable is false, indicating failure. The loop is broken directly, and a TimeoutException ("Timeout waiting for connection") exception is thrown. When this exception is thrown, the system error log will clearly indicate that the connection cannot be obtained. At the same time, it also avoids the thread being occupied all the time. 5. Find the "culprit" from the stack In the previous section, we analyzed the details of the underlying source code of Condition concurrency from the first stack log. But this is not the end, because we only analyzed the WAITING (parking) state in the statistics of java.lang.Thread.State, and the cause of the problem may not be caused by this state. Next, we will continue to analyze another "abnormal" thread state WAITING (on object monitor). The second key log in the Java stack is as follows:
java.lang.Thread.State: WAITING (on object monitor). This thread state should also attract special attention. The object lock is monitored and the thread is blocked without being released. According to the thread stack information, it is speculated that it is related to the HttpClient parameter settings. Let's analyze the creation parameters. Looking at the top of the stack, we can see that the wait() method of the Object object is called, indicating that it is waiting for another thread to notify, but it has not waited for a long time, and the current thread has been in the WAITING state. Continue to find out who called:
This code call causes you to look at the source code:
The lookupTable object was found, and the synchronized block lock was used. The wait() method of the lookupTable object was called internally, but no notification was received here and the method was blocked. After checking the code for this problem, you cannot find any problem because it has little to do with the application itself. It is because of the JVM thread deadlock problem caused by IPV6. Refer to the foreign Zimbra site wiki: https://wiki.zimbra.com/wiki/Configuring_for_IPv4 Here is the explanation of the cause of the problem: The application itself is in an IPv4 environment, and if you try to use IPv6, it will cause some known problems. When the Inet6AddressImpl.lookupAllHostAddr() method is called, there is a bug between Java and the operating system libc library. When a specific race condition occurs, the host address search action will continue in an infinite loop. This situation occurs very rarely, but once it occurs, it will cause the JVM to deadlock, which will cause all threads in the JVM to be blocked. According to the above analysis, the third key stack log is found in the jstack stack as follows:
How to determine whether IPv6 is enabled in the operating system? Two methods are introduced: (1)ifconfig This is obvious. The words inet6 addr indicate that IPv6 is enabled. (2)lsmod
Mainly look at the Used column, the value is 70+, and the Used column is 1 in an environment that does not support IPv6 (this value may be different in different server environments). 6. Summary of Problem Optimization Solutions After analyzing the key thread status in the Java stack, the cause of the problem was identified. Next, the solution to the problem is described. First question: When a connection cannot be obtained from the Http connection pool, the thread may enter a blocked state. Add connectionRequestTimeout in the HttpClient client initialization parameter configuration to obtain the connection timeout. It is generally not recommended to set it too large. We set it to 500ms. After setting, the underlying condition#awaitUtil(deadline) method will be called. When the thread cannot be awakened by the signal and the deadline is reached, the thread will be automatically awakened from the waiting queue and added to the AQS synchronization queue to compete for the lock. Second question: There are two solutions to the JVM process deadlock problem caused by IPv6: (1) Disable IPv6 at the operating system level Edit the /etc/sysctl.conf file and add the following two lines:
Save the file and execute sysctl -p to make it take effect. Execute the following command in the operating system to take effect directly:
(2) Java application level Add -Djava.net.preferIPv4Stack=true to the application JVM startup parameters. Disable IPv6 from the operating system level. If other applications are deployed on the server, pay attention to them. If you encounter any problems, you can use search engines to check. We have many servers, all of which are maintained by operations and maintenance, so I adopted the second method, which is to add parameters directly on the JVM, which is simple and convenient. Final summary: There are two key WAITING thread states in the java stack log. First, WAITING (on object monitor) appears, because the IPv6 problem triggers the blocking of all threads in the HttpClient thread pool. Then WAITING (parking) appears. The Tomcat thread receives the forwarded request. When the request is called to HttpClient, it cannot obtain the Http connection resource and the timeout for obtaining the connection is not set, causing a large number of threads to be blocked. After optimizing the above two issues, we have been online for a long time and have experienced higher traffic than when this issue occurred. The JVM thread blocking issue has not occurred. Statistics from the network command line show that there are basically no large numbers of CLOSE_WAIT network connection states. |
<<: Can't tell the difference between Wi-Fi and WLAN? Stop confusing them
>>: 5G is coming, these 10 common things will disappear in the future
[[413351]] Differences between UDP and TCP In the...
[[381740]] This article is reprinted from the WeC...
CloudCone is a foreign VPS hosting company founde...
[51CTO.com original article] On September 25, 202...
As websites such as Google, Taobao, and Baidu hav...
[[237506]] Image source: Visual China In the clou...
Since 4G, the bandwidth of carriers has increased...
The long Spring Festival holiday has passed. In t...
It’s no secret that 5G cellular is coming, but mo...
[[386495]] This article is reprinted from the WeC...
Blockchain is essentially a decentralized databas...
A week ago, we shared RackNerd's promotional ...
[[381755]] 1. Inter-process communication (IPC) in...
In wireless network construction, it is said that...