Understanding TCP three-handshake, four-handshake and Java stack analysis to source code from an online failure

Understanding TCP three-handshake, four-handshake and Java stack analysis to source code from an online failure

Introduction:

  • Introduction to production failure scenarios
  • TCP three-way handshake process
  • TCP disconnection four wave process
  • Analyze source code with Java stack
  • Find the "culprit" from the stack
  • Summary of Problem Optimization Solutions

[[283192]]

1. Introduction to production failure scenarios

Business Introduction:

This service mainly provides external proxy interfaces. Most of the interfaces will call third-party interfaces to obtain data, aggregate it, and then provide it to the client for use.

One evening, when the system was at its peak, the project team members were enjoying their "overtime meal". Just as they put the food in their mouths, they received emails and text messages warning them at the same time.

There was a server interface timeout. I occasionally receive similar alarms, sometimes due to network fluctuations and other reasons. I am really sorry that I always let the "network" classmates take the blame for no reason:). However, this time the alarm did not converge and continued for more than ten minutes, which made me feel bad.

When I clicked the URL interface link in the email, the page kept spinning in circles and the response was very slow. What a tragedy!

At this moment, I silently pushed the lunch box aside, unable to bear to look at it :(

Basic process of problem location:

(1) Determine the scope of impact

There are multiple servers behind this service, and only one server has crashed, so it will not have much impact on users. Temporarily remove it from the registration center, and do not let the client continue to retry to this server, so as to preserve the accident scene.

(2) Check monitoring indicators

Check the access volume of the interface service. Because it is the evening peak, the user access volume will be greater than that of other time periods, but this access volume does not seem to have a particularly obvious sudden increase compared with the same period of time. The CPU, memory, IO, and network indicators of the server observed on the monitoring also seem to be normal.

(3) Server troubleshooting

Log in to the server and check the server CPU, memory and other indicators in combination with monitoring. Check the service logs and they are all normal. There is no special abnormal log output, Exception or OOM and other abnormalities.

What we saw was that the interface service could no longer respond normally. The application ran on the JVM, and we quickly checked it using the common commands that come with the JDK.

The following command prints the stack information:

  1. jstack -l $pid > jstack.log

The statistical results are as follows:

If you encounter thread states such as java.lang.Thread.State: WAITING (parking) and java.lang.Thread.State: WAITING (on object monitor), you should pay attention, as they are usually caused by problems in the program itself.

According to java.lang.Thread.State: WAITING, we checked the stack information in jstack.log and found a large number of logs of calling HttpClient tool class requests waiting for suspension. The specific stack information will be analyzed in detail below.

These service calls are directly called through the HttpClient tool, which encapsulates Spring RestTemplate. The underlying layer also calls the Apache HttpClient tool class to implement service calls.

In addition to seeing the above jstack log exception, the network status on the server was also checked, which is also a common troubleshooting method used by operation and maintenance students.

Attach the command to count the network connection status:

  1. netstat -n | awk '/^tcp/ {++State[$NF]} END {for(i in State) print i, State[i]}'

Statistical results:

  1. TIME_WAIT 9
  2. CLOSE_WAIT 3826
  3. SYN_SENT 2
  4. ESTABLISHED 28
  5. SYN_RECV 8

Note here that we see a large number of connections in the CLOSE_WAIT state in the server's strange network connection statistics.

Moreover, this state will still exist when you execute the statistical command again after a period of time, that is, it will not be released. It seems that the problem is quite serious.

Further speculation is that the appearance of these CLOSE_WAIT states is probably related to the slow response of the interface. At the same time, it is also related to the HttpClient thread blocking that appears in the Java stack information, which is used as a breakthrough point for analysis.

Let's first understand the CLOSE_WAIT state. The CLOSE_WAIT state is in the process of TCP network disconnection. When the client initiates a disconnection request, the server receives the disconnection request for the first time and replies with a confirmation message. After that, it is in the CLOSE_WAIT state. When the server completes the response processing, it will reply to the client with a network packet, and the normal connection will be closed.

2. TCP three-way handshake process

Although the CLOSE_WAIT state is in the four-way handshake process of the TCP network connection, it is still necessary to first understand the three-way handshake of the TCP network connection, because it is the first thing the requesting server needs to do, which is to establish a TCP connection.

Technology comes from life.

We can use an example from daily life to understand the TCP three-way handshake process.

For example, you want to chat with a friend you haven't seen for a long time on WeChat:

Xiaodong: Xiaosheng, are you there? (After a long time... )

Xiaosheng: Yes, are you still here? (Xiaodong is online right now, checking his Moments every day...)

Xiaodong: Yes, yes, I am here (then the two of them started chatting enthusiastically...)

If you always start a conversation with friends like this, wouldn’t you feel tired?

In fact, the above process can well understand the TCP three-way handshake process.

Let's consider Xiaodong as the "client" and Xiaosheng as the "server". Xiaodong is a programmer and works in IT. Xiaosheng is starting a business in his hometown.

Understand the TCP three-way handshake process:

  • Xiaodong, as the "client", initiates a chat with Xiaosheng, as the "server", by sending a network packet (with the signature syn) to Xiaosheng. [This is the first TCP handshake, and Xiaodong is in the syn_sent state at this time]
  • Xiaosheng received Xiaodong's chat network packet. You have to confirm it to indicate that you have received it. At this time, Xiaosheng had other things to do. Unlike Xiaodong, who was doing IT work and was on WeChat at work. In the evening, Xiaosheng had time to check WeChat on his phone and a message from Xiaodong popped up. Then, he excitedly replied, "In response to the sync packet sent by Xiaodong, I made an ack reply to confirm." Because it has been a while, Xiaosheng is not sure whether Xiaodong is still online. [This is the second TCP handshake. Xiaosheng is in the syn_rcvd state at this time]
  • Xiaodong was online at the time and received Xiaosheng's reply confirmation message. He immediately responded to the message, "I responded to Xiaosheng's sync + ack with a further ack reply confirmation. This is the third TCP handshake." [Xiaosheng's status has now changed to "established" and he can chat immediately]
  • At this time, Xiaosheng was also online, and the two began to chat enthusiastically. [The data is officially transmitted, and the status of Xiaodong and Xiaosheng are both in the established state]

The states mentioned above, syn_sent syn_rcvd established, are the key states involved in the TCP three-way handshake process.

Let’s take a look at the previous picture to understand it intuitively:

3. TCP disconnection four wave process

Xiaodong and Xiaosheng’s heated chat ended. It was already late and they had been busy all day, so they needed to rest.

Xiaodong has to get up early tomorrow for work, so he told Xiaosheng in advance:

Xiaodong: I have to get up at 4am tomorrow to upgrade the system. I have to go to bed early. I'll buy you a drink another day!

Xiaosheng: Uh??? Well, I don’t understand anyway!

Xiaosheng: Then you should go to bed early. You still have to drink the wine you mentioned!

Xiaodong: Yeah, good night! You should go to bed early too.

Xiaosheng: OK, good night, brother!

Corresponding understanding of the TCP four-wave process:

  • Xiaodong wants to take a break, so he sends a fin1 packet to end the chat. [Xiaodong is in the fin_wait1 state at this time, which is the first TCP handshake]
  • Xiaosheng received Xiaodong's fin1 packet and replied with an ack confirmation message. [At this time, Xiaosheng is in the close_wait state, and Xiaodong is in the fin_wait2 state. This is the second TCP wave.]
  • Xiaosheng sent a final confirmation, indicating that he did not intend to continue chatting, and sent a fin2 packet. [At this time, Xiaosheng is in the last_ack state, which is the third wave of TCP]
  • Xiaodong finally responded with an ack confirmation to the fin2 packet sent by Xiaosheng. [At this time, Xiaodong is in the time_wait state, which is the fourth TCP handshake]

Why does Xiaodong have to wait in the time_wait state?

Because they followed the "old rules", they had to make sure Xiaosheng actually received the final message before they could finally end the chat.

The standard waiting time for the time_wait state is 4 minutes. During this period, the TCP network connection socket resources (ports) established by Xiaodong and Xiaosheng cannot be used by others, nor can they be recycled and reused by the system.

If Xiaosheng does not receive any feedback, he will continue to ask "re-send fin2 message" until Xiaodong successfully sends the ack message. Both parties officially close the chat channel, release the port resources, and close the connection.

The 4 minutes of waiting is 2 MSLs, and each MSL is 2 minutes.

MSL stands for maximium segment lifetime, which is the maximum message lifetime. This time is specified by the official RFC protocol.

The previous picture provides a further intuitive understanding:

4. Analyze source code with Java stack

After analyzing the TCP four-wave process, when the server receives the TCP disconnect request packet, it needs to reply a confirmation message to the client. At this time, the server status is in CLOSE_WAIT state.

We know the location of this state in the network connection. Combined with the problems mentioned above, a large number of threads are blocked on the HttpClient call. The thread state is WAITING. Statistics on the server show that a large number of network connections in the CLOSE_WAIT state cannot be released.

Threads are a relatively valuable resource in the JVM process. If a large number of threads are always in a waiting or blocked state, all threads will gradually be occupied, causing the service to be unable to respond normally.

Let's analyze the specific reasons through the Java thread stack information and the source code.

Find the first key stack log:

  1. "http-nio-8970-exec-1108" #24971 daemon prio = 5   os_prio = 0   tid = 0x00007f45b4445800   nid = 0x61ad waiting on condition [0x00007f444ad69000]
  2. java.lang.Thread.State: WAITING (parking)
  3. at sun.misc.Unsafe.park(Native Method)
  4. - parking to wait for < 0x00000006c2f30968 > (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
  5. at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  6. at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
  7. at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:380)
  8. at org.apache.http.pool.AbstractConnPool.access$200(AbstractConnPool.java:69)
  9. at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:246)
  10. - locked < 0x0000000641c7fe38 > (a org.apache.http.pool.AbstractConnPool$2)
  11. at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:193)
  12. at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:303)
  13. at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:279)
  14. at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:191)
  15. at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
  16. at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
  17. at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
  18. at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
  19. at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
  20. at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
  21. at org.springframework.http.client.HttpComponentsClientHttpRequest.executeInternal(HttpComponentsClientHttpRequest.java:89)
  22. at org.springframework.http.client.AbstractBufferingClientHttpRequest.executeInternal(AbstractBufferingClientHttpRequest.java:48)
  23. at org.springframework.http.client.AbstractClientHttpRequest.execute(AbstractClientHttpRequest.java:53)
  24. at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:660)
  25. at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:629)
  26. at org.springframework.web.client.RestTemplate.getForEntity(RestTemplate.java:329)
  27. at com.xxx.proxy.common.util.HttpClientUtil.getForEntity(HttpClientUtil.java:267)
  28. at com.xxx.proxy.common.util.HttpClientUtil.getForObject(HttpClientUtil.java:521)
  29. ... ...

A large number of the above logs appear in the stack logs, which are basically called by the HttpClient tool class. All threads are in the java.lang.Thread.State: WAITING (parking) state.

The WAITING (parking) thread hangs because the interface service calls a large number of third-party interfaces to obtain HTTP connections, but it cannot obtain them and can only wait.

The HttpClientUtil tool class inherits from Spring RestTemplate and has some parameters, retry mechanisms, and proxy customization. Its package path is located in org.springframework.web.client.RestTemplate.

The class diagram is shown below:

Create HttpClient tool sample code:

  1. HttpClientFactoryBean httpClientFactoryBean = new HttpClientFactoryBean(config);
  2. HttpComponentsClientHttpRequestFactory httpRequestFactory = new HttpComponentsClientHttpRequestFactory(httpClientFactoryBean.getObject());
  3. return new HttpClientUtil(httpRequestFactory);

HttpClientFactoryBean inherits from AbstractFactoryBean and overrides the getObjectType() and createInstance() methods.

The class diagram is shown below:

Some example methods of HttpClientFactoryBean:

  1. @Override
  2. public Class <? > getObjectType() {
  3. return HttpClient.class;
  4. }
  5.  
  6. @Override
  7. protected HttpClient createInstance() {
  8. if ( restConfig == null) {
  9. HttpClients.custom().build();
  10. }
  11. // Maximum number of connections per route
  12. int maxPerRoute = restConfig .getMaxConnections();
  13. // The maximum number of total connections
  14. int maxTotal = restConfig .getMaxTotalConnections();
  15. // Connection timeout
  16. int connectTimeout = restConfig .getConnectionTimeout();
  17. //Timeout for reading data
  18. int socketTimeout = restConfig .getTimeout();
  19.  
  20. PoolingHttpClientConnectionManager connManager = new PoolingHttpClientConnectionManager(30, TimeUnit.SECONDS);
  21. connManager.setDefaultMaxPerRoute(maxPerRoute);
  22. connManager.setMaxTotal(maxTotal);
  23. connManager.setValidateAfterInactivity(1000);
  24.  
  25. RequestConfig requestConfig = RequestConfig .custom().setConnectTimeout(connectTimeout) .setSocketTimeout(socketTimeout).build();
  26.  
  27. / ... Omit some code
  28. return HttpClients.custom().setConnectionManager(connManager).evictExpiredConnections().setDefaultRequestConfig(requestConfig).build();
  29. }

According to the stack information, we can also see that the connection is obtained from the leaseConnection() method of PoolingHttpClientConnectionManager. Then we can look at the source code in detail to see why it is not successfully obtained.

How to find the source code? Through the call stack link in the stack information, you can easily find which classes and methods are passed and which line of code.

According to the log in jstack above:

  1. org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:380)

As the name suggests, the AbstractConnPool abstract connection pool class is used to call the getPoolEntryBlocking blocking method to obtain the connection, line 380 of the code.

View source code:

  1. private E getPoolEntryBlocking(
  2. final T route, final Object state,
  3. final long timeout, final TimeUnit tunit,
  4. final Future < E > future) throws IOException, InterruptedException, TimeoutException {
  5.  
  6. Date deadline = null ;
  7. //Connection acquisition timeout parameters
  8. if (timeout > 0) {
  9. deadline = new Date (System.currentTimeMillis() + tunit.toMillis(timeout));
  10. }
  11. this.lock.lock();
  12. try {
  13. final RouteSpecificPool < T , C, E >   pool = getPool (route);
  14. // .... Omit some source code
  15.  
  16. boolean success = false ;
  17. try {
  18. if (future.isCancelled()) {
  19. throw new InterruptedException("Operation interrupted");
  20. }
  21. // Put the future, whose actual type is Future < CPoolEntry > , into the pending doubly linked list queue
  22. pool.queue(future);
  23. this.pending.add(future);
  24. if (deadline != null) {
  25. success = this .condition.awaitUntil(deadline);
  26. } else {
  27. // This is the source code of line 380
  28. this.condition.await();
  29. success = true ;
  30. }
  31. if (future.isCancelled()) {
  32. throw new InterruptedException("Operation interrupted");
  33. }
  34. finally
  35. // In case of 'success', we were woken up by the
  36. // connection pool and should now have a connection
  37. // waiting for us, or else we're shutting down.
  38. // Just continue in the loop, both cases are checked.
  39. pool.unqueue(future);
  40. this.pending.remove(future);
  41. }
  42. // check for spurious wakeup vs. timeout
  43. if (!success && (deadline != null && deadline.getTime() < = System.currentTimeMillis())) {
  44. break;
  45. }
  46. }
  47. throw new TimeoutException("Timeout waiting for connection");
  48. finally
  49. this.lock.unlock();
  50. }
  51. }

Find the source code on line 380, where the await() method of condition is called:

  1. this.condition.await();

Here, Condition under the concurrent package is used to implement a multi-threaded coordination communication mechanism. After the await() method is called, the current thread will be added to the Condition waiting queue, which is a FIFO structure queue. At the same time, the current thread lock is released. If it is not released, other threads will not be able to obtain the lock, which may cause a deadlock.

await() method source code:

  1. public final void await() throws InterruptedException {
  2. if (Thread.interrupted())
  3. throw new InterruptedException();
  4. // Join the Condition waiting queue
  5. Node node = addConditionWaiter ();
  6. // Release the lock of the current thread
  7. long savedState = fullyRelease (node);
  8. int interruptMode = 0 ;
  9. // Not in the AQS synchronization queue, suspend the current thread, if in the AQS queue, exit the loop
  10. while (!isOnSyncQueue(node)) {
  11. LockSupport.park(this);
  12. if (( interruptMode = checkInterruptWhileWaiting (node)) != 0)
  13. break;
  14. }
  15. // Already awakened by the signal() method, spin-wait to try to acquire the lock again
  16. if (acquireQueued(node, savedState) && interruptMode != THROW_IE)
  17. interruptMode = REINTERRUPT ;
  18. if (node.nextWaiter != null) // clean up if canceled
  19. unlinkCancelledWaiters();
  20. if (interruptMode != 0)
  21. reportInterruptAfterWait(interruptMode);
  22. }

The current thread joins the Condition waiting queue structure diagram:

When the signalAll() or signal() method is called through Condition, the first node of the waiting queue is obtained, moved to the synchronization queue, and the thread in the node is awakened using LockSupport. The node is moved from the waiting queue to the AQS synchronization queue as shown in the following structure diagram:

In the AbstractConnPool class, we found the release() code for releasing the connection.

The release() method source code is as follows:

  1. @Override
  2. public void release(final E entry, final boolean reusable) {
  3. this.lock.lock();
  4. try {
  5. if (this.leased.remove(entry)) {
  6. final RouteSpecificPool < T , C, E >   pool = getPool (entry.getRoute());
  7. pool.free(entry, reusable);
  8. if (reusable && !this.isShutDown) {
  9. this.available.addFirst(entry);
  10. } else {
  11. entry.close();
  12. }
  13. onRelease(entry);
  14. Future < E >   future = pool .nextPending();
  15. if (future != null) {
  16. this.pending.remove(future);
  17. } else {
  18. future = this .pending.poll();
  19. }
  20. if (future != null) {
  21. this.condition.signalAll();
  22. }
  23. }
  24. finally
  25. this.lock.unlock();
  26. }
  27. }

We can see that when releasing the connection, calling this.condition.signalAll(); the call of the signalAll() method will wake up all the waiting queue threads. Although all are woken up, only one thread can get the lock, and other threads still need to spin and wait.

The signalAll() method source code is as follows:

  1. private void doSignalAll(Node first) {
  2. lastWaiter = firstWaiter = null;
  3. do {
  4. Node next = first .nextWaiter;
  5. first.nextWaiter = null ;
  6. //Signal notification
  7. transferForSignal(first);
  8. first = next ;
  9. } while (first != null);
  10. }
  11.  
  12. final boolean transferForSignal(Node node) {
  13. /*
  14. * Set the node's waitStatus: Condition- > 0
  15. */
  16. if (!compareAndSetWaitStatus(node, Node.CONDITION, 0))
  17. return false;
  18.  
  19. // Join the AQS waiting queue, let the node continue to acquire the lock, and set the previous node status to SIGNAL
  20. Node p = enq (node);
  21. int c = p.waitStatus ;
  22. if (c > 0 || !compareAndSetWaitStatus(p, c, Node.SIGNAL))
  23. // Call LockSupport's unpark() method to wake up the thread
  24. LockSupport.unpark(node.thread);
  25. return true;
  26. }

After analyzing the underlying code, we look back and see that it is because the await() method without parameters of condition is called, and the Http connection cannot be obtained, and the tomcat thread is occupied.

At the beginning of the getPoolEntryBlocking() method, there is a section of code that cannot be ignored:

  1. Date deadline = null ;
  2. //Connection acquisition timeout parameters
  3. if (timeout > 0) {
  4. deadline = new Date (System.currentTimeMillis() + tunit.toMillis(timeout));
  5. }

This code is a timeout at first glance. I guess that the code here should be the waiting time when getting a connection from the connection pool.

See below the getPoolEntryBocking() method:

  1. if (deadline != null) {
  2. success = this .condition.awaitUntil(deadline);
  3. }

If deadline is not empty, the awaitUtil(deadline) method of condition will be called. The awaitUtil(deadline) method indicates that if the deadline is reached and the thread is not awakened, it will automatically wake up and join the AQS synchronization queue to acquire the lock.

We can continue to look for the caller based on the stack information and see the source of the timeout in the deadline.

  1. at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:191)

MainClientExec#execute() method source code:

  1. final HttpClientConnection managedConn;
  2. try {
  3. final int timeout = config .getConnectionRequestTimeout();
  4. managedConn = connRequest .get(timeout > 0 ? timeout : 0, TimeUnit.MILLISECONDS);
  5. } catch(final InterruptedException interrupted) {
  6. }

The timeout here, namely connectionRequestTimeout, is the timeout value for calculating the deadline, which confirms our guess.

The initial configuration parameters of the HttpClient tool are initialized, and the connectionRequestTimeout parameter is not configured. This parameter is also very critical. If it is not set, and the thread suspended by park has not been awakened by the signal, it will continue to wait.

Therefore, this parameter must be set. The deadline here is an absolute time. When it is not empty, the awaitUtil(deadline) method of the condition will be called. Even if it is not awakened by the signal, it will automatically wake up and compete for the lock, instead of being blocked without being awakened.

Moreover, the design of the deadline variable in the awaitUtil(deadline) method is similar to that in the awaitNanos(long nanosTimeout) method.

When the set timeout period is reached and no signal is given, the success variable is false, indicating failure. The loop is broken directly, and a TimeoutException ("Timeout waiting for connection") exception is thrown.

When this exception is thrown, the system error log will clearly indicate that the connection cannot be obtained. At the same time, it also avoids the thread being occupied all the time.

5. Find the "culprit" from the stack

In the previous section, we analyzed the details of the underlying source code of Condition concurrency from the first stack log. But this is not the end, because we only analyzed the WAITING (parking) state in the statistics of java.lang.Thread.State, and the cause of the problem may not be caused by this state. Next, we will continue to analyze another "abnormal" thread state WAITING (on object monitor).

The second key log in the Java stack is as follows:

  1. "http-nio-8970-exec-462" #24297 daemon prio = 5   os_prio = 0   tid = 0x00007f45b41bd000   nid = 0x5f0b in Object.wait() [0x00007f446befa000]
  2. java.lang.Thread.State: WAITING (on object monitor)
  3. at java.lang.Object.wait(Native Method)
  4. at java.lang.Object.wait(Object.java:502)
  5. at java.net.InetAddress.checkLookupTable(InetAddress.java:1393)
  6. - locked < 0x00000006c05a5570 > (a java.util.HashMap)
  7. at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1310)
  8. at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
  9. at java.net.InetAddress.getAllByName(InetAddress.java:1192)
  10. at java.net.InetAddress.getAllByName(InetAddress.java:1126)
  11. at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45)
  12. at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112)
  13. at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373)
  14. at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
  15. at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
  16. at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
  17. at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
  18. at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
  19. at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
  20. at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
  21. at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
  22. at org.springframework.http.client.HttpComponentsClientHttpRequest.executeInternal(HttpComponentsClientHttpRequest.java:89)
  23. at org.springframework.http.client.AbstractBufferingClientHttpRequest.executeInternal(AbstractBufferingClientHttpRequest.java:48)
  24. at org.springframework.http.client.AbstractClientHttpRequest.execute(AbstractClientHttpRequest.java:53)
  25. at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:660)
  26. at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:629)
  27. at org.springframework.web.client.RestTemplate.getForEntity(RestTemplate.java:329)
  28. at com.xxx.tvproxy.common.util.HttpClientUtil.getForEntity(HttpClientUtil.java:267)
  29. at com.xxx.tvproxy.common.util.HttpClientUtil.getForObject(HttpClientUtil.java:521)
  30. ``

java.lang.Thread.State: WAITING (on object monitor). This thread state should also attract special attention. The object lock is monitored and the thread is blocked without being released.

According to the thread stack information, it is speculated that it is related to the HttpClient parameter settings. Let's analyze the creation parameters. Looking at the top of the stack, we can see that the wait() method of the Object object is called, indicating that it is waiting for another thread to notify, but it has not waited for a long time, and the current thread has been in the WAITING state.

Continue to find out who called:

  1. at java.net.InetAddress.checkLookupTable(InetAddress.java:1393) ``

This code call causes you to look at the source code:

  1. private static InetAddress[] checkLookupTable(String host) {
  2. synchronized (lookupTable) {
  3. // If the host isn't in the lookupTable, add it in the
  4. // lookuptable and return null. The caller should do
  5. // the lookup.
  6. if (lookupTable.containsKey(host) == false) {
  7. lookupTable.put(host, null);
  8. return null;
  9. }
  10.  
  11. // If the host is in the lookupTable, it means that another
  12. // thread is trying to look up the addresses of this host.
  13. // This thread should wait.
  14. while (lookupTable.containsKey(host)) {
  15. try {
  16. // Corresponding to the stack java.net.InetAddress.checkLookupTable(InetAddress.java:1393)
  17. lookupTable.wait();
  18. } catch (InterruptedException e) {
  19. }
  20. }
  21. }
  22.  
  23. // The other thread has finished looking up the addresses of
  24. // the host. This thread should retry to get the addresses
  25. // from the addressCache. If it doesn't get the addresses from
  26. // the cache, it will try to look up the addresses itself.
  27. InetAddress[] addresses = getCachedAddresses (host);
  28. if ( addresses == null ) {
  29. synchronized (lookupTable) {
  30. lookupTable.put(host, null);
  31. return null;
  32. }
  33. }
  34.  
  35. return addresses;
  36. }

The lookupTable object was found, and the synchronized block lock was used. The wait() method of the lookupTable object was called internally, but no notification was received here and the method was blocked.

After checking the code for this problem, you cannot find any problem because it has little to do with the application itself. It is because of the JVM thread deadlock problem caused by IPV6.

Refer to the foreign Zimbra site wiki: https://wiki.zimbra.com/wiki/Configuring_for_IPv4

Here is the explanation of the cause of the problem:

The application itself is in an IPv4 environment, and if you try to use IPv6, it will cause some known problems.

When the Inet6AddressImpl.lookupAllHostAddr() method is called, there is a bug between Java and the operating system libc library. When a specific race condition occurs, the host address search action will continue in an infinite loop. This situation occurs very rarely, but once it occurs, it will cause the JVM to deadlock, which will cause all threads in the JVM to be blocked.

According to the above analysis, the third key stack log is found in the jstack stack as follows:

  1. java.lang.Thread.State: RUNNABLE
  2. at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
  3. at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
  4. at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
  5. at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
  6. at java.net.InetAddress.getAllByName(InetAddress.java:1192)
  7. at java.net.InetAddress.getAllByName(InetAddress.java:1126)
  8. at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45)
  9. at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112)
  10. at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373)
  11. at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
  12. at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
  13. at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
  14. at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
  15. at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
  16. at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
  17. at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
  18. at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
  19. at
  20. ... ...

How to determine whether IPv6 is enabled in the operating system?

Two methods are introduced:

(1)ifconfig

This is obvious. The words inet6 addr indicate that IPv6 is enabled.

(2)lsmod

  1. [root@BJ]# lsmod | grep ipv6
  2. Module Size Used by
  3. ipv6 335951 73 bridge

Mainly look at the Used column, the value is 70+, and the Used column is 1 in an environment that does not support IPv6 (this value may be different in different server environments).

6. Summary of Problem Optimization Solutions

After analyzing the key thread status in the Java stack, the cause of the problem was identified. Next, the solution to the problem is described.

First question:

When a connection cannot be obtained from the Http connection pool, the thread may enter a blocked state.

Add connectionRequestTimeout in the HttpClient client initialization parameter configuration to obtain the connection timeout. It is generally not recommended to set it too large. We set it to 500ms.

After setting, the underlying condition#awaitUtil(deadline) method will be called. When the thread cannot be awakened by the signal and the deadline is reached, the thread will be automatically awakened from the waiting queue and added to the AQS synchronization queue to compete for the lock.

Second question:

There are two solutions to the JVM process deadlock problem caused by IPv6:

(1) Disable IPv6 at the operating system level

Edit the /etc/sysctl.conf file and add the following two lines:

  1. net.ipv6.conf.all.disable_ipv6 = 1  
  2. net.ipv6.conf.default.disable_ipv6 = 1  

Save the file and execute sysctl -p to make it take effect.

Execute the following command in the operating system to take effect directly:

  1. sysctl -w net.ipv6.conf.all.disable_ipv6 = 1  
  2. sysctl -w net.ipv6.conf.default.disable_ipv6 = 1  

(2) Java application level

Add -Djava.net.preferIPv4Stack=true to the application JVM startup parameters.

Disable IPv6 from the operating system level. If other applications are deployed on the server, pay attention to them. If you encounter any problems, you can use search engines to check.

We have many servers, all of which are maintained by operations and maintenance, so I adopted the second method, which is to add parameters directly on the JVM, which is simple and convenient.

Final summary:

There are two key WAITING thread states in the java stack log. First, WAITING (on object monitor) appears, because the IPv6 problem triggers the blocking of all threads in the HttpClient thread pool. Then WAITING (parking) appears. The Tomcat thread receives the forwarded request. When the request is called to HttpClient, it cannot obtain the Http connection resource and the timeout for obtaining the connection is not set, causing a large number of threads to be blocked.

After optimizing the above two issues, we have been online for a long time and have experienced higher traffic than when this issue occurred. The JVM thread blocking issue has not occurred. Statistics from the network command line show that there are basically no large numbers of CLOSE_WAIT network connection states.

<<:  Can't tell the difference between Wi-Fi and WLAN? Stop confusing them

>>:  5G is coming, these 10 common things will disappear in the future

Recommend

On "GPL is a trap for software developers"

[[381740]] This article is reprinted from the WeC...

CloudCone: $15/year KVM-512MB/15GB/5TB/Los Angeles data center

CloudCone is a foreign VPS hosting company founde...

...

5G Downlink Channel Sounding "CSI-RS"

Since 4G, the bandwidth of carriers has increased...

5G indoor coverage poses problems for office buildings and operators

It’s no secret that 5G cellular is coming, but mo...

RackNerd: San Jose data center VPS promotion starting at $11.88/year

A week ago, we shared RackNerd's promotional ...

Message bus for communication between processes

[[381755]] 1. Inter-process communication (IPC) in...

Summarize various issues between wireless AP and AC of WiFi system

In wireless network construction, it is said that...