How packets travel through the various layers of the TCP/IP protocol stack

How packets travel through the various layers of the TCP/IP protocol stack

All Internet services rely on the TCP/IP protocol stack. Understanding how data is transmitted in the protocol stack will help you improve the performance of Internet programs and solve TCP-related problems.

We describe how packets are transmitted at the protocol layer in a Linux scenario.

1. Sending Data

The process of sending data at the application layer is roughly as follows:

We roughly divide the above processing areas into:

  • User Area
  • Kernel Area
  • Device Area

The tasks in the user and kernel areas are all performed by the local CPU. These two areas are combined into the host area to distinguish the device area (there is a separate CPU on the network interface card). The device is the network interface card (Network Interface Card) that receives and sends data packets, also commonly known as a LAN card.

When the application calls write(fd, buf, len) to send data, the user state area will enter the kernel state area. The bond that establishes this relationship is the socket fd and the system call write.

The kernel socket has two buffers:

  • send socket buffer, used to send data
  • receive socket buffer, used to receive data

When the write system call is executed, the user-mode data (buf, length) is copied to the kernel area memory and placed at the end of the send socket buffer (see the figure below, the sending is in sequence), and then TCP is called.

The data structure in TCP is TCB (TCP Control Block). TCB contains the information needed to execute a TCP session, including TCP connection status, receive window, congestion window, sequence number, retransmission timer, etc.

TCP creates TCP data segments, and TCP data segments include TCP header and payload, as shown below:

Payload is the data in the socket buffer to be sent, and TCP header is auxiliary information added to ensure reliable data transmission via TCP.

These data segments will enter the IP layer, and the IP layer will add IP header information to the data segments, as shown below:

Before executing routing, IP will check the Netfilter LOCAL_OUT hook to see if iptables related configuration needs to be executed. Then IP routing is executed. The main function of IP routing is to find the IP address of the next hop (such as a gateway or router), and the purpose of routing is to reach the machine where the destination IP address is located.

After IP performs routing, it checks the Netfilter POST_ROUTING hook, and if there is iptables configuration in this regard, it will perform related operations. Before entrusting to the data link layer, the IP layer will also perform ARP (network address translation), find the destination MAC address through the next hop IP address, and add the Ethernet header to the IP data packet, as shown in the figure below.

The IP layer also provides users with a raw socket interface, which is an interface for sending data packets. The data packets sent by the raw socket are different from the data packets of the normal process. When executing Netfilter, these hooks will be skipped.

After the IP layer finishes its work, it entrusts the data packet (the data packet in the above figure is generally called a frame) to the data link layer.

Since ARP has written the destination MAC address into the packet header, this reduces the driver's work. After entering the data link layer, the kernel will detect whether there is a packet capture tool (such as tcpdump) listening to the packet. If so, the kernel will copy the packet information to the memory address space of the packet capture tool.

After that, according to certain protocol rules, the driver will ask the NIC to pass the data packet. When the NIC receives this request, it copies the data packet into its own memory and sends it to the network. When the NIC sends a data packet, an interrupt is generated, and the host CPU executes the interrupt handler to complete the subsequent work.

2. Receiving Data

The process of an application receiving data is roughly as follows:

First, the NIC writes the data packet into its own memory and verifies whether the data packet is valid. If it is valid, it writes the data packet into the host's memory space. Then the NIC sends an interrupt signal to the host operating system, and then enters the kernel area.

At the data link layer, the kernel will first detect the data packet, and then the driver will modify the data packet so that TCP/IP can understand it. After modification, it is distributed to the upper layer according to the Ethertype in the Ethernet header information. Assuming it is IPv4, the Ethernet header is removed and sent to the IP layer. It is worth noting that before entrusting to the IP layer, if there is a packet capture tool listening to the packet, the kernel will copy the packet information to the memory address space of the packet capture tool.

The IP layer verifies whether the checksum of the IP header is valid by calculating the checksum. If it is valid, it then checks the PRE_ROUTING hook (for example, to see if there is a corresponding iptables configuration that needs to be executed), and then executes IP routing. IP routing will determine whether the data packet is processed locally or forwarded to other hosts. If it is a forwarded data packet, the FORWARD and POST_ROUTING hooks are executed and forwarded to the data link layer; if it is processed locally, IP will also check the LOCAL_IN hook. After execution, based on the proto value of the IP header information, it is assumed to be TCP, remove the IP header, and pass the data packet to the upper TCP layer. It is worth noting that before entrusting to the TCP layer, if there is a raw socket listening to capture packets, the kernel will copy the data packet information to the memory address space of the raw socket (tcpcopy uses raw socket to listen to IP layer packets by default).

The TCP layer will detect whether the data packet is valid based on the TCP checksum (if checksum offload is used, the NIC will do the relevant calculation), and then search for the corresponding TCB (TCP control block) for the data packet. The search method is to search through the following combination of information:

  1. < source IP, source port, target IP, target port >  

If not found, a reset packet is generally sent; if found, the TCP packet processing phase is entered.

If new data is received, TCP will put it into the socket receive buffer, and then send an ack to confirm the data packet if necessary according to the TCP status. The size of the socket receive buffer is the TCP receive window size. To some extent, if the receive window is large, the TCP throughput will be large. Currently, newer kernels can dynamically adjust the window size without the user having to modify system parameters.

The user application performs a read operation based on the read event, and the user space enters the kernel space. The kernel copies the contents of the socket buffer to the memory area specified by the user, and then releases the contents of the socket buffer that have been read. TCP increases the size of the receive window, and if necessary, it will pass a window update data packet to the peer TCP. For example, in the figure below, TCP sends an ack data packet to notify the peer TCP that the local TCP receive window has been updated.

After the read operation is completed, it returns to the application, and the application can process the data.

3. Working principle of packet capture tools

Now that we know how data is sent and received, let's analyze the principle of tcpdump packet capture.

At the junction of the data link layer and the IP layer (belonging to the data link layer, as shown below), is where the data packets are captured by tcpdump.

When the kernel reaches this junction, it will check whether tcpdump is listening. If it is listening, it will put the data packet content into the buffer set by tcpdump. In theory, as long as tcpdump extracts data in time and the online pressure is not large, the packet will not be lost.

The data packets captured by tcpdump only represent the data packets that have passed through the boundary between the link layer and the network layer. The future fate of the data packets coming from the network card may be to continue to go all the way to TCP, or to be killed at the IP layer, or to be forwarded by the router; the data packets sent from the local machine, once captured by tcpdump, have reached the data link layer and have not been filtered out by the IP layer, because if the data packets are filtered out by the IP layer, these data packets will not reach the tcpdump capture point and will not appear in the captured packet file.

Below we conduct some experiments to verify the above conclusions.

Before the experiment, let's first introduce the iptables tool. iptables is a widely used firewall tool that mainly interacts with the kernel netfilter packet filtering framework.

(1) Experimental LOCAL_IN filtering

We configure the following iptables commands on the server:

  1. iptables -I INPUT -p tcp --dport 3306 -s 172.17.0.2 -j QUEUE

The above iptables command sets the "-I INPUT" parameter, which means that the above iptables rules are executed at the netfilter LOCAL_IN hook, that is, before reaching the server-side TCP, if the above iptables rules are matched, they will be placed in the target QUEUE (the default is to directly discard the data packet) and will not move forward.

See the following figure for specific command execution:

After setting the above iptables, when 172.17.0.2 accesses the 172.17.0.3 3306 service, the IP data packet (as shown in the green arrow in the figure below) will be discarded at the IP layer on the server side, and the direction indicated by the red arrow is where tcpdump captures the packet.

We start tcpdump to capture packets:

  1. tcpdump -i any tcp and port 3306 and host 172.17.0.2 -n -v

Use the MySQL client command on 172.17.0.2 to access the 3306 service on 172.17.0.3, as shown below:

After a long wait, it finally showed that the connection could not be made.

The packet capture results on the server are as follows:

We see the first handshake packet being retransmitted over and over again.

Use the netstat command to check whether there is a corresponding TCP status. The result shows that there is none, as shown in the following figure:

Under normal circumstances, there is no TCP status, which means that the data packet has not entered the server-side TCP and the first handshake data packet has been killed at the server-side IP layer.

Use the netstat -s command to find clues in the TCP/IP statistics parameters on the server:

In the figure above, the server IP layer received 20079 data packets, and in the figure below, it received 20086 data packets. The MySQL client login process added a total of 7 data packets, which just matched the 7 first handshake data packets shown in the packet capture file.

At the server-side TCP layer, comparing the two figures above, there is no change in the data, indicating that the server-side TCP has not received any data packets.

The experiment shows that if the data packets are killed in the direction of the IP layer on the server, there will be no change in the TCP layer on the server.

(2) Experimental LOCAL_OUT filtering

The purpose of this experiment is to check the packet capture situation under the IP layer netfilter LOCAL_OUT condition.

As shown below:

We set the following iptables command:

  1. iptables -I OUTPUT -p tcp --sport 3306 -d 172.17.0.2 -j QUEUE

The specific operation is as follows:

The above iptables command sets the OUTPUT parameter, which means that the above iptables rules will be executed at the netfilter LOCAL_OUT hook, that is, before the IP routing, if the IP data packet matches the above iptables rules, it will be placed in the target QUEUE (the default is to directly discard the data packet) and will not go further down.

Use the MySQL client command on 172.17.0.2 to access the 3306 service on 172.17.0.3, as shown below:

After a long wait, it finally showed that the connection could not be made.

The packet capture results on the server are as follows:

We see that the first handshake data packet is repeatedly retransmitted, which is almost exactly the same as the previous capture result.

Use the netstat command to check whether there is a corresponding TCP status. It turns out that there is a SYN_RECV status, as shown below:

The presence of TCP status indicates that the data packet enters the server-side TCP and enters the SYN_RECV state. The server-side TCP will send a second handshake packet, but the packet capture shows that there is no second handshake packet, indicating that it has been blocked by the iptables configuration.

View the netstat -s results:

The upper graph shows the values ​​before the experiment, and the lower graph shows the values ​​after the experiment.

From the TCP level information, 17 data segments were sent, indicating that the server-side TCP sent the second handshake data packet and sent it many times. However, because iptables was set, these data packets were intercepted and could not reach the data link layer, and could not be captured by tcpdump.

From these two experiments, we can see that the data packets captured by tcpdump are the same, and both are trying to retransmit the first handshake data packets, but the locations of iptables settings are different, one is at the ingress, stateless at the TCP layer, and the other is at the egress, stateful at the TCP layer.

Further analysis can try the following two directions:

  • Distinguish these two situations by analyzing the TCP status
  • Using netstat -s to show TCP/IP statistics

Through the above experiment, we can see that tcpdump packet capture only observes the world from one point and cannot see the whole picture. At this time, reasoning is needed to assist in solving the problem.

4. Potential protocol layer interference

(1) Receiving data

The following figure shows the process of a data packet from the NIC to the protocol stack and then to the application.

TCP offload is completed by NIC, the purpose is to reduce the workload of TCP, but there are potential pitfalls; at the data link layer, there is a packet capture interface for packet capture tools such as tcpdump, and there is also a raw socket original packet capture interface; at the network layer, there is a raw socket packet capture interface, IP Forward forwarding function, and a complete set of Netfilter frameworks (where there are many pitfalls); at the TCP layer, it is relatively quiet and less disturbed; user programs retrieve data from TCP or obtain new connections through the socket interface.

(2) Sending data

The following diagram shows the process of sending data packets from the application to the NIC.

User programs use the socket interface to entrust TCP to send data or establish connections. At the network layer, there is a raw socket packet sending interface and a complete Netfilter framework (where there are many pitfalls). At the data link layer, there is a pcap packet sending interface and a raw socket original packet sending interface. TCP offload is done by NIC to improve and reduce the workload of TCP (such as segmentation and checksum). We have also encountered packet loss problems caused by improper TCP offload.

(3) Case

Here is an example of receiving a packet from the NIC, all the way to the application, and then sending a response:

Our application is Nginx (web server software), where Nginx is configured to listen on port 8080 and has access log enabled.

The figure above sets nginx keepalive_timeout = 0, which means that the client connection is kept idle (for the convenience of experiments).

Start nginx and check through netstat that nginx is already listening for connection requests on port 8080.

At the beginning, there was no access to nginx, the access log was empty, and iptables was not set up.

On the 172.17.0.2 machine, use telnet to access the 8080 port service on 172.17.0.3, as shown below:

In this way, telnet establishes a connection with nginx. The figure below shows that the corresponding connection on the server side has entered the ESTABLISHED state.

After the connection is established, we set the iptables command, as shown in the figure below, to intercept and discard the nginx response returned to 172.17.0.2.

We continue to execute the telnet command on the client (172.17.0.2), type "GET hello.html", and then press Enter to execute.

From the nginx log, we can see that this request has been processed. Although it is an illegal request, it has been confirmed to have reached nginx.

About 2 minutes later, I checked the packet capture status on the client and found that a total of 16 packets had been captured. The client also showed that the connection was in the ESTABLISHED state.

We checked the situation on the server side and found that we could not find the corresponding connection on the server side using netstat, which means that the TCP layer connected to the server side no longer exists.

We analyze the packet capture situation (the server packet capture and the client packet capture have the same effect):

Since sending the request packet, the client has been retransmitting the request packet because it has not seen any data packets from the server. The client thinks that the server has not received the request, but in fact the request has been processed by nginx.

Check the statistics of netstat -st on the server.

The above picture shows the status before executing the telnet request, and the following picture shows the status after executing the telnet request.

From the above figure, we can see that connection aborted due to timeout has increased by one, which means that from the perspective of TCP on the server side, the response data packet of the request (with the close fin flag) cannot be sent out, so the connection is aborted. At this time, the corresponding connection status cannot be seen on the server side.

From the upper layer nginx's point of view, an illegal request was encountered, so a response was sent and the connection was closed. From the TCP layer's point of view, since the data packet with the closed FIN cannot reach the tcpdump packet capture interface, the TCP state on the server side will be in the FIN_WAIT_1 state ("What to do when encountering a large number of FIN_WAIT1?" will be described in detail), which will last for a while and keep trying to retransmit. Since the retransmission has not received a response, TCP changes the FIN_WAIT_1 state to the CLOSED state, and the connection cannot be found on the server side.

In this case, we know in advance that we have set up iptables, but if we don’t know, how can we determine where the problem lies?

It is obviously not enough to just rely on tcpdump to capture packets, because through packet capture analysis, we can only conclude that the server has not received the request. We also need to use the information on the server to make further judgments. Through the nginx log, we can determine that the request has been processed by the application layer, which means that the request data packet has reached the application layer, nginx has processed the request and responded, and then entrusted the server TCP to send these response data packets, but obviously the responses sent by the server TCP have not reached the packet capture interface, indicating that they have been killed at the IP layer. Therefore, we can use this information to find the netfilter-related configuration of the outgoing direction of the data packet to see if there is any filtering for these responses.

From the above case, we can see that it is not enough to just use tcpdump. You also need to use various information and reason to finally figure out where the problem lies in order to solve the problem. If you don't know how to use this knowledge, the client will make the wrong judgment that the server did not receive the request.

5. Cross-machine judgment

During cross-machine access, there are the following potential interferences (pitfalls):

  • The machine's own IP layer security filtering
  • The link layer sends a QUEUE packet loss
  • Potential issues with TCP offload at the link layer (NIC is classified as a data link layer here)
  • Various problems with intermediate equipment (equipment includes routers/switches/firewalls/gateways/load balancers, etc.)
  • The peer machine's link layer receives QUEUE packets that are lost
  • Potential issues with TCP offload (NIC) at the peer link layer
  • Peer IP layer security filtering
  • The peer TCP abnormal state interference

These issues will be introduced in TCPCopy and other chapters and will not be described in detail here.

6. Analysis of common tool work levels

The figure above shows the working levels of some popular tools. For example, tcpcopy works at layer 4 by default and calls the raw socket interface provided by the IP layer to capture and send packets. The netstat or ss tool can be used to obtain various TCP/IP statistics. LVS works at layer 4 and uses Netfilter to forcibly change the route. tcpdump works at the data link layer. HTTP applications work at the application layer.

Understanding these working principles can help you understand the problems more deeply and solve various TCP-related problems.

<<:  Aruba Launches Instant On to Provide Secure, High-Speed ​​Wireless Connectivity for Small and Medium Businesses

>>:  When Private LTE Is Better Than Wi-Fi

Recommend

Is 5G cooperation the starting point for operators’ value return?

After the 5G licenses were issued, the market gen...

VULTR adds 32nd data center in the world: Tel Aviv, Israel

In February this year, we shared the news that VU...

IPv6 global penetration rate reaches 27%, 6G will be deployed in 2030

Recently, Latif Ladid, chairman of the National I...

With a downlink rate of over 100Mbps, can Starlink really replace 5G?

According to Mobile World Live, Ookla's lates...

After a year, Wi-Fi 6 has become standard. Here is everything you want to know

Back in September 2019, Apple officially released...