Talk about TCP long connection and heartbeat

Talk about TCP long connection and heartbeat

[[254870]]

1 Introduction

Many Java programmers may only have a three-way handshake and four-way handshake understanding of TCP. I think the main reason for this is that the TCP protocol itself is a little abstract (compared to the HTTP protocol at the application layer); secondly, non-framework developers do not need to be exposed to some details of TCP. In fact, I personally do not fully understand many details of TCP. This article mainly aims to make a unified summary of the long connection and heartbeat issues raised by some people in the WeChat communication group.

In Java, when using TCP communication, Socket and Netty are likely to be involved. This article will use some of their APIs and setting parameters to assist in the introduction.

2 Long connections and short connections

TCP itself does not distinguish between long and short connections. Whether it is long or short depends entirely on how we use it.

  • Short connection: Each time a communication occurs, a socket is created; when a communication ends, socket.close() is called. This is a short connection in the general sense. The advantage of a short connection is that it is easier to manage, and all existing connections are available connections, without the need for additional control measures.
  • Long connection: After each communication is completed, the connection will not be closed, so that the connection can be reused. The advantage of long connection is that it saves the time of creating a connection.

The advantages of short connections and long connections are the disadvantages of each other. If you want to keep things simple and not pursue high performance, it is appropriate to use short connections, so that we don't need to worry about the management of the connection status; if you want to pursue performance and use long connections, we need to worry about various issues: such as end-to-end connection maintenance and connection keepalive.

Persistent connections are also often used to push data. Most of the time, we still think of communication as a request/response model, but the nature of TCP duplex communication determines that it can also be used for two-way communication. Under persistent connections, the push model can be easily implemented.

There is not much to say about short connections, so below we will focus on some issues about long connections. Purely talking about theory is a bit too monotonous, so below I will use some practices of the RPC framework Dubbo to discuss TCP.

3. Long Connections in the Service Governance Framework

As mentioned above, when pursuing performance, you will inevitably choose to use a long connection, so Dubbo can help you understand TCP well. We start two Dubbo applications, a server responsible for listening to the local port 20880 (as we all know, this is the default port of the Dubbo protocol), and a client responsible for sending requests in a loop. Execute the lsof-i:20880 command to view the relevant usage of the port:

*:20880(LISTEN) indicates that Dubbo is listening to the local port 20880 and processing requests sent to the local port 20880.

The last two pieces of information describe the request sending status, verifying that TCP is a two-way communication process. Since I started two Dubbo applications on the same machine, you can see that the local port 53078 is communicating with the port 20880. We did not manually set the client port 53078, it is random, but it also illustrates a truth: even the party sending the request needs to occupy a port.

Let me briefly talk about the FD parameter. It represents the file handle. Each new connection will occupy a new file handle. If you get an open too many files exception when using TCP communication, you should check whether you have created too many connections and have not closed them. Careful readers will also think of another benefit of long connections, which is that they will occupy fewer file handles.

4. Maintenance of long connections

Because the services requested by the client may be distributed on multiple servers, the client naturally needs to create multiple long connections with the other end. When using long connections, the first problem we encounter is how to maintain long connections.

  1.   //Client
  2.  
  3. public class NettyHandler extends SimpleChannelHandler {
  4. private final Map<String, Channel> channels = new ConcurrentHashMap<String, Channel>(); // <ip:port, channel>
  5. }
  6. //Server
  7. public class NettyServer extends AbstractServer implements Server {
  8. private Map<String, Channel> channels; // <ip:port, channel>
  9. }

In Dubbo, both the client and the server use ip:port to maintain end-to-end persistent connections, and Channel is an abstraction of the connection. We mainly focus on the persistent connection in NettyHandler. The server also maintains a collection of persistent connections, which is a design of Dubbo, and we will mention it later.

5. Connection Keepalive

This topic is worth discussing, and it involves many knowledge points. First of all, we need to make it clear why we need to report the connection alive? When the two parties have established a connection, but the link is not connected due to network problems, the long connection cannot be used. It should be made clear that it is not a very reliable thing to check the connection status in the ESTABLISHED state through commands such as netstat and lsof, because the connection may be dead, but it is not perceived by the system, not to mention the difficult problem of false death. Ensuring the availability of long connections is a technical job.

6. KeepAlive

The first thing that comes to mind is the KeepAlive mechanism in TCP. KeepAlive is not part of the TCP protocol, but most operating systems implement this mechanism. After the KeepAlive mechanism is turned on, if there is no data transmission on the link within a certain period of time (usually 7200s, parameter tcp_keepalive_time), the TCP layer will send the corresponding KeepAlive probe to determine the connection availability. After the probe fails, it will retry 10 times (parameter tcp_keepalive_probes), each time interval is 75s (parameter tcp_keepalive_intvl). After all probes fail, the current connection is considered unavailable.

Enable KeepAlive in Netty:

  1. bootstrap.option (ChannelOption.TCP_NODELAY , true )

To set KeepAlive related parameters in the Linux operating system, modify the /etc/sysctl.conf file:

  1. net.ipv4.tcp_keepalive_time=90
  2. net.ipv4.tcp_keepalive_intvl=15
  3. net.ipv4.tcp_keepalive_probes=2
  • The KeepAlive mechanism ensures the availability of the connection at the network level, but from the application framework level, we believe that this is not enough. This is mainly reflected in two aspects:
  • The KeepAlive switch is turned on at the application layer, but the settings of specific parameters (such as retry test and retry interval) are at the operating system level, located in the /etc/sysctl.conf configuration of the operating system, which is not flexible enough for applications.
  • The KeepAlive mechanism works only when the link is idle. If data is sent and the physical link is disconnected, but the link status on the operating system is still ESTABLISHED, what will happen? The TCP retransmission mechanism will naturally be used. It should be noted that the default TCP timeout retransmission and exponential backoff algorithm is also a very long process.

KeepAlive itself is network-oriented, not application-oriented. When the connection is unavailable, it may be due to GC problems in the application itself, high system load, etc., but the network is still connected. At this time, the application has lost its activity, so the connection should naturally be considered unavailable.

It seems that keeping the connection alive at the application level is still necessary.

7. Connection keepalive: application layer heartbeat

Finally, the topic is here. The heartbeat mentioned in the title is another TCP-related knowledge point that this article wants to emphasize. In the previous section, we have explained that KeepAlive at the network level is not enough to support the connection availability at the application level. In this section, let's talk about the heartbeat mechanism at the application layer to achieve connection keepalive.

How to understand the heartbeat at the application layer? Simply put, the client will start a scheduled task and send a request to the peer application that has established a connection (the request here is a special heartbeat request), and the server needs to handle the request specially and return a response. If the heartbeat continues for many times without receiving a response, the client will think that the connection is unavailable and actively disconnect. Different service governance frameworks have different strategies for heartbeats, connection establishment, disconnection, and blacklisting mechanisms, but most service governance frameworks will perform heartbeats at the application layer, and Dubbo is no exception.

8 Design details of application layer heartbeat

Taking Dubbo as an example, it supports heartbeat at the application layer. Both the client and the server will start a HeartBeatTask. The client starts it in HeaderExchangeClient, and the server starts it in HeaderExchangeServer.

  1. //HeartBeatTask
  2. if (channel instanceof Client) {  
  3. ((Client) channel).reconnect();  
  4. } else {
  5. channel.close () ;
  6. }

Students who are familiar with other RPC frameworks will find that the heartbeat mechanisms of different frameworks are really very different. The heartbeat design is also related to connection creation, reconnection mechanism, and blacklist connection, and specific analysis of specific frameworks is required.

In addition to the design of scheduled tasks, heartbeat support is also required at the protocol level. The simplest example can be referred to the health check of nginx. For the Dubbo protocol, heartbeat support is also required. If the heartbeat request is identified as normal traffic, it will cause pressure problems on the server, interfere with current limiting and many other problems.

Flag represents the flag bit of the Dubbo protocol, which has 8 address bits in total. The lower four bits are used to indicate the type of serialization tool used for the message body data (the default is Hessian). Among the upper four bits, the first bit is 1 for request, the second bit is 1 for bidirectional transmission (i.e., there is a returned response), and the third bit is 1 for heartbeat event.

Heartbeat requests should be treated differently from normal requests.

9. Pay attention to the difference between HTTP KeepAlive and HTTP

  • The KeepAlive feature of the HTTP protocol is intended to reuse connections and transmit request-response data serially on the same connection.
  • TCP's KeepAlive mechanism is intended to maintain aliveness, heartbeat, and detect connection errors.

These are two completely different concepts.

10 KeepAlive Common Exceptions

Applications that enable TCP KeepAlive can generally capture the following types of errors

ETIMEOUT timeout error. After sending a detection protection packet, (tcpkeepalivetime + tcpkeepaliveintvl * tcpkeepaliveprobes) time has passed and no ACK confirmation is received. The exception is triggered and the socket is closed. java java.io.IOException: Connectiontimedout

EHOSTUNREACH host unreachable (host unreachable) error, this should be ICMP reported to the upper layer application. java java.io.IOException: No route to host

The connection is reset, and the terminal may crash and freeze. After restarting, it receives a message from the server, but things have changed and people have changed. It can only respond with a helpless reset announcement. java java.io.IOException:Connectionresetbypeer

11 Conclusion

There are three practical scenarios for using KeepAlive:

1. By default, the KeepAlive cycle is 2 hours. If you do not choose to change it, it is a misuse and causes resource waste: the kernel will open a keepalive timer for each connection, and N connections will open N keepalive timers. The advantages are obvious:

  • TCP protocol layer keep-alive detection mechanism, the system kernel completely automatically does it for the upper-layer application
  • Kernel-level timers are more efficient than upper-level applications.
  • The upper layer application only needs to handle data sending and receiving and connection abnormality notification
  • Data packets will be more compact

2. Disable TCP's KeepAlive and use the application layer heartbeat keep-alive mechanism. The application is in charge of the heartbeat, which is more flexible and controllable. For example, the heartbeat period can be set at the application level to adapt to private protocols.

3. Business heartbeat + TCP KeepAlive are used together to complement each other, but the TCP keepalive detection cycle and the application heartbeat cycle must be coordinated to complement each other. The gap cannot be too large, otherwise the expected effect will not be achieved.

The designs of each framework are different. For example, Dubbo uses solution 3, but the HSF framework within Alibaba does not set TCP KeepAlive, and only uses the application heartbeat to keep alive. Like the heartbeat strategy, this is related to the overall design of the framework.

<<:  uCPE/vCPE and the network: You are in me, I am in you

>>:  What is CDN? A detailed explanation of CDN in one article

Recommend

WOT Xu Dongchen: JVM-Sandbox Non-intrusive runtime AOP solution based on JVM

[51CTO.com original article] On May 18-19, 2018, ...

Three-layer network model of Internet products

If you ask what is the biggest feature of Interne...

GSMA: Global 5G connections will reach 1.8 billion by 2025

According to a new study from GSMA, global 5G con...

4 major roles of the network in enterprise digital transformation

Currently, digital transformation is described as...

Five steps to modernize your enterprise network

The business value of the network has never been ...

Ruijie Networks: Continue to Lead, "Our Journey Is to the Stars and the Sea"

[51CTO.com original article] As cloud desktop tec...

What are baseband and radio frequency used for?

"End-to-end" is popular nowadays. Let&#...

Trip.com QUIC high availability and performance improvements

First, the QUIC multi-process deployment architec...