TCP has already implemented KeepAlive, why does the application layer need to implement it again?

TCP has already implemented KeepAlive, why does the application layer need to implement it again?

TCP Heartbeat

TCP Keepalive is a mechanism used to detect whether a TCP connection is active. It determines the status of the connection by periodically sending detection packets. It is mainly used to detect idle (zombie) connections and maintain NAT mapping (NAT devices, firewall devices).

1. Brief description of the principle

  • To enable TCP Keepalive automatic detection mechanism, both communicating parties need to enable Keepalive option
  • If there is no data transmission within a certain period of time (default 2 hours), TCP will send a Keepalive detection packet.
  • If the other party is still active, it will respond to the probe packet. If the other party does not respond, TCP will retry sending the probe packet.
  • After reaching the maximum number of retries (default 10), if no response is received, TCP will consider the connection disconnected and close the connection.

The following is the converted logical pseudo code (for a single TCP connection) based on the working principle of TCP Keepalive.

 # TCP Keepalive 控制参数KEEPALIVE_INTERVAL = 7200 # 默认2 小时KEEPALIVE_PROBES = 10 # 默认10 次KEEPALIVE_TIMEOUT = 75 # 默认75 秒# 开启TCP Keepalive 机制# 初始化各项控制参数def enable_keepalive(socket): socket.setsockopt(..., socket.SO_KEEPALIVE, 1) socket.setsockopt(..., KEEPALIVE_INTERVAL) socket.setsockopt(..., KEEPALIVE_PROBES) socket.setsockopt(... KEEPALIVE_TIMEOUT) # 如果连接在Keepalive 间隔时间内处于空闲状态# 发送Keepalive 探测包并启动探测计时器def send_keepalive_probe(socket): if is_idle(socket, KEEPALIVE_INTERVAL): send_probe_packet(socket) start_probe_timer(socket) # 处理Keepalive 响应# 如果收到探测包的ACK 确认,则重置空闲计时器# 否则增加探测次数# 如果超过最大探测次数,则关闭连接# Function to handle keepalive response def handle_keepalive_response(socket): if received_probe_ack(socket): reset_idle_timer(socket) else: increment_probe_count(socket) if probe_count(socket) > KEEPALIVE_PROBES: close_connection(socket) # 检查Keepalive超时# 如果探测计时器过期,则处理Keepalive 响应def check_keepalive_timeout(socket): if probe_timer_expired(socket): handle_keepalive_response(socket) # 核心主循环管理Keepalive # 1. 启用Keepalive 选项# 2. 在连接打开时定期发送探测包和检查超时# Main loop to manage keepalive def manage_keepalive(socket): enable_keepalive(socket) while is_open(socket): send_keepalive_probe(socket) check_keepalive_timeout(socket) time.sleep(KEEPALIVE_TIMEOUT) ... ...

2. Related parameters

Several parameters related to the TCP Keepalive mechanism in the Linux kernel are as follows:

  • tcp_keepalive_time: Idle time before first detection (default 2 hours)
  • tcp_keepalive_intvl: The interval between retry detections (default 75 seconds)
  • tcp_keepalive_probes: Maximum number of retries (default 10)

Of course, these parameters can be modified by modifying the system configuration file. Especially when optimizing backend servers for high-concurrency scenarios and mobile scenarios, these parameters need to be optimized:

 # 设置首次探测之前的空闲时间为10 分钟echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time # 设置重试探测的时间间隔为15 秒echo 15 > /proc/sys/net/ipv4/tcp_keepalive_intvl # 设置最大重试次数为3 次echo 3 > /proc/sys/net/ipv4/tcp_keepalive_probes

Run the sysctl -p command to make the settings effective. The settings will still be effective after reboot.

3. Limitations

The TCP Keepalive mechanism is executed by the kernel (operating system). When a process exits, the kernel will close the unclosed connections in the process one by one (send a FIN message to the other party of the connection). This ensures that both parties of each connection can know the status of the communication and complete different specific business logics based on the status.

On the surface, the TCP Keepalive mechanism can be well implemented by the kernel regardless of whether the process is running or exiting. However, in some extreme scenarios, the kernel cannot guarantee the normal operation of the TCP protocol stack, for example:

  • The operating system is abnormal and causes a restart, so the TCP protocol stack has no chance to send a FIN message.
  • If the server hardware fails or the basic configuration fails (such as power outage, network disconnection, or geographical force majeure), the TCP protocol stack will not have the opportunity to send a FIN message.
  • When there are a large number of concurrent connections and the operating system or process is restarted, the TCP protocol stack may not be able to disconnect all connections. In other words, after the FIN message is lost, there is no more time to retry.
  • If a network link fails, both parties can only confirm this situation after the TCP Keepalive detection times out. At this time, some time may have passed since the failure occurred.

Application layer heartbeat

1. Necessity

In the previous article, we talked about the limitations of the TCP Keepalive mechanism (kernel implementation). In addition, when combined with the application layer, the TCP Keepalive mechanism cannot confirm the heartbeat detection target of the application layer: the application is still working normally. Specifically, a normal TCP Keepalive detection result can only mean two things:

  • The application (process) still exists
  • The network link is normal

However, when an exception occurs during the running of the application process, such as deadlock, infinite loop caused by bug, infinite blocking, etc., although the operating system can still execute the TCP Keepalive mechanism normally, the communication partner cannot be informed of the abnormal situation of the application.

In addition, application layer heartbeat detection has better flexibility, such as the ability to control detection time, interval, exception handling mechanism, and append additional data.

To sum up, application layer heartbeat detection must be implemented.

2. Implementation

Common application layer heartbeat implementations are:

  • HTTP: Access the specified URL and determine whether the application is normal based on the response code or response data
  • Exec: Executes the specified (Shell) command (such as file check, network check), and checks the exit status code of the command. If the status code is 0, it means that the application is running normally.
  • WebSocket: Similar to HTTP detection method
  • Other custom detection methods

The mainstream detection method in the industry is HTTP (long connection method), mainly because:

  • HTTP is simple to implement, and its persistent connection approach avoids the overhead of establishing and releasing connections.
  • HTTP has low requirements for (heterogeneous) environments, and most applications use HTTP as the main API communication protocol, so heartbeat detection does not bring much additional workload.

3. Implementation details

(1) Do not implement a separate “heartbeat thread”

Using a separate thread to implement "heartbeat detection" can isolate the heartbeat detection application code from the specific business logic code, but when the "business thread" deadlocks or crashes due to a bug, the heartbeat thread cannot detect it.

Therefore, heartbeat detection should be implemented directly in the "business thread".

(2) Do not implement the “heartbeat connection” separately

For network (such as TCP) programming scenarios, heartbeat detection should be implemented directly in the "business connection" instead of using a separate connection, so that when an abnormality occurs in the business connection, the communicating party can perceive it immediately (no heartbeat response is received in time).

In addition, most network firewalls will periodically monitor and clear idle (zombie) connections. If the heartbeat detection uses an additional connection, then when the "business connection" has no data to send for a long time, it will be disconnected by the firewall. However, the heartbeat detection connection is still working normally at this time, which will affect the judgment of the other party in the communication, thinking that the "business connection" is still working normally.

Therefore, heartbeat detection should be implemented directly in the "business connection".

<<: 

>>: 

Recommend

8 myths about 5G

5G is the next generation of wireless broadband t...

Hacking Bitcoin and the Blockchain

Turn on the TV or read a tech blog, and you will ...

How packets travel through the various layers of the TCP/IP protocol stack

All Internet services rely on the TCP/IP protocol...

22 pictures to explain OSPF: the most commonly used dynamic routing protocol

Hello everyone, I am Xiao Fu. RIP Defects When ta...

Under the trend of "new infrastructure", the cybersecurity industry is booming

In the next few years, the general direction of n...

BICS: 5G device connectivity unlocks new IoT use cases

BICS, a global voice operator and mobile data ser...