Data packet sending processFirst, the green chat software clients of our two mobile phones need to communicate through their servers. It looks like this. Chat software three-way communication But to simplify the model, we omit the intermediate server and assume that this is an end-to-end communication. And to ensure the reliability of the message, we blindly guess that they use the TCP protocol for communication. Chat software communication between both ends In order to send data packets, both ends first establish a TCP connection through a three-way handshake. When a data packet is sent from a chat box, the message will be copied from the user space where the chat software is located to the send buffer in the kernel space. The data packet will then go through the transport layer and the network layer and enter the data link layer. Here, the data packet will go through flow control (qdisc) and then be sent to the network card at the physical layer through the RingBuffer. The data will then be sent through the network card to the complex network world. The data will jump between multiple routers and switches and finally reach the network card of the destination machine. At this time, the network card of the destination machine will notify DMA to put the data packet information into RingBuffer, and then trigger a hard interrupt to the CPU. The CPU triggers a soft interrupt to let ksoftirqd go to RingBuffer to receive the packet. So a data packet goes through the physical layer, data link layer, network layer, transport layer, and finally is copied from kernel space to the chat software in user space. Panoramic view of network packet sending and receiving
At this point, leaving aside some details, everyone has a general idea of the macro process of a data packet from sending to receiving. As you can see, it is full of nouns. Along the entire link, packet loss may occur in many places. But in order to prevent you from squatting for too long and affecting your health, I will only focus on a few common scenarios where packet loss may occur. Packet loss when establishing a connectionThe TCP protocol establishes a connection through a three-way handshake . It looks like this. TCP three-way handshake On the server side, after the first handshake, a half-connection will be established, and then a second handshake will be issued. At this time, there needs to be a place to temporarily store these half-connections. This place is called a half-connection queue. If the third handshake comes later, the semi-connection will be upgraded to a full connection, and then temporarily stored in another place called the full connection queue, waiting for the program to execute Semi-connected queues and fully connected queues Queues have length, and if they have length, they may be full. If they are full, new packets will be discarded. You can use the following method to check whether this packet loss behavior exists. #Full connection queue overflow times From the phenomenon, it can be seen that the connection establishment failed. Flow Control Packet LossThere are so many softwares in the application layer that can send network data packets. If all the data is rushed into the network card without control, the network card will not be able to handle it. What should we do? Let the data queue up according to certain rules and process them in turn, which is the so-called qdisc (Queueing Disciplines), which is also what we often call the flow control mechanism. To queue, there must first be a queue, and the queue has a length. We can see through the following ifconfig command that the number 1000 after txqueuelen is actually the length of the flow control queue. When data is sent too fast and the flow control queue length txqueuelen is not large enough, packet loss is likely to occur. qdisc packet loss You can use the following ifconfig command to view the dropped field under TX. When it is greater than 0, flow control packet loss may have occurred. # ifconfig eth0 When encountering this situation, we can try to modify the length of the flow control queue. For example, increase the length of the flow control queue of the eth0 network card from 1000 to 1500 as follows. # ifconfig eth0 txqueuelen 1500 NIC packet lossThe network card and its driver often cause packet loss. There are many reasons, such as poor quality of the network cable and poor contact. In addition, let's talk about a few common scenarios. RingBuffer is too small, causing packet lossAs mentioned above, when receiving data, the data will be temporarily stored in the RingBuffer receiving buffer, and then slowly collected after the kernel triggers a soft interrupt. If this buffer is too small and the data is sent too fast at this time, overflow may occur, and packet loss will occur at this time. Packet loss due to RingBuffer being full We can use the following command to check whether this has happened. # ifconfig Look at the overruns metric above, which records the number of overflows due to insufficient RingBuffer length. Of course, you can also use the ethtool command to view it. # ethtool -S eth0|grep rx_queue_0_drops But it should be noted that, because a network card can have multiple RingBuffers, the 0 in the rx_queue_0_drops above represents the number of packet drops in the 0th RingBuffer. For a network card with multiple queues, this 0 can be changed to other numbers. But my family conditions do not allow me to see the number of packet drops in other queues, so the above command is enough for me. When this type of packet loss is found, you can use the following command to view the configuration of the current network card. #ethtool -g eth0 The above output means that the maximum supported length of RingBuffer is 4096, but only 1024 is actually used. To change this length, you can execute ethtool -G eth1 rx 4096 tx 4096 to change the length of both the send and receive RingBuffer to 4096. When the RingBuffer is increased, packet loss due to small capacity can be reduced. Insufficient network card performanceAs a piece of hardware, the network card has an upper limit on its transmission speed. When the network transmission speed is too high and reaches the upper limit of the network card, packet loss will occur. This situation is common in stress testing scenarios. We can use ethtool plus the network card name to obtain the maximum speed supported by the current network card. # ethtool eth0 As you can see, the maximum transmission speed supported by the network card I use is speed=1000Mb/s. This is commonly known as a Gigabit network card, but please note that the unit here is Mb, and the b here refers to bit, not Byte. 1Byte=8bit. So 10000Mb/s should be divided by 8, which means that the theoretical maximum transmission speed of the network card is 1000/8 = 125MB/s. We can use the sar command to analyze the sending and receiving of data packets from the network interface level. # sar -n DEV 1 txkB/s refers to the total number of bytes sent per second, and rxkB/s refers to the total number of bytes received per second. When the sum of the two is approximately 12-130,000 bytes, it corresponds to a transmission speed of about 125MB/s. At this point, the performance limit of the network card is reached and packet loss will begin. When you encounter this problem, first check whether your service really has such a large amount of real traffic. If so, you can consider splitting the service, or just bear the pain and spend money to upgrade the configuration. Receive buffer packet lossWhen we usually use TCP socket for network programming, the kernel will allocate a send buffer and a receive buffer. When we want to send a data packet, we execute send(msg) in the code. At this time, the data packet does not fly out directly through the network card. Instead, the data is copied to the kernel send buffer and then returned. As for when to send the data and how much data to send, the kernel will decide later. tcp_sendmsg logic The receive buffer has a similar function. Data packets received from the external network are temporarily stored here, waiting for the user space application to take the data packets away. These two buffers have size limits, which can be viewed using the following command. # Check the receive buffer Whether it is the receive buffer or the send buffer, you can see three values, corresponding to the minimum value, default value and maximum value of the buffer (min, default, max). The buffer will dynamically adjust between min and max. So the question is, what happens if the buffer is set too small? For the send buffer, when executing send, if it is a blocking call, it will wait until there is space in the buffer to send data. send blocking If it is a non-blocking call, an EAGAIN error message will be returned immediately, which means Try again. Let the application try again next time. In this case, packet loss generally does not occur. send non-blocking When the receiving buffer is full, things are different. Its TCP receive window will become 0, which is the so-called zero window, and it will tell the sender through win=0 in the data packet, "I can't handle it anymore, don't send it anymore." Generally, in this case, the sender should stop sending messages, but if there is still data to be sent at this time, packet loss will occur. recv_buffer packet loss We can use TCPRcvQDrop in the following command to check whether this packet loss has occurred. cat / proc / net / netstat But sadly, we usually don't see TCPRcvQDrop, because it was introduced in version 5.9, and our servers usually use versions 2.x~3.x. You can check which version of the Linux kernel you are using with the following command. # cat / proc / version Network packet loss between the two endsWhat was mentioned earlier is the network packet loss inside the machines at both ends. In addition, the long link between the two ends belongs to the external network, which has various routers, switches, optical cables, etc., and packet loss also occurs very frequently. These packet losses occur on some machines in the middle link, and we certainly do not have permission to log in to these machines. But we can use some commands to observe the connectivity of the entire link. Ping command to check packet lossFor example, we know that the destination domain name is baidu.com. If we want to know whether there is packet loss between your machine and the baidu server, we can use the ping command. Ping to check packet loss The second-to-last line contains 100% packet loss, which means the packet loss rate is 100%. But in this way, you can only know whether there is packet loss between your machine and the destination machine. If you want to know which node in the link between you and the destination machine lost the packet, is there any way? have. mtr CommandThe mtr command can view the packet loss of each node between your machine and the destination machine. Execute the command as shown below. mtr_icmp The -r option means report, which prints the results in a report. You can see that the Host column contains the machines at each hop in the link, and the Loss column refers to the packet loss rate corresponding to this hop. It should be noted that some of the hosts in the middle are ???. This is because mtr uses ICMP packets by default, and some nodes restrict ICMP packets, resulting in failure to display normally. We can add -u in the mtr command, that is, use udp packets, and then we can see the corresponding IPs of some???. mtr-udp Putting the results of ICMP packets and UDP packets together will form a relatively complete link diagram. There is another small detail, the Loss column. In the icmp scenario, pay attention to the last row. If it is 0%, it doesn’t matter whether the previous loss is 100% or 80%. Those are false reports caused by node limitations. But if the last line is 20%, and the previous lines are all around 20%, it means that the packet loss started from the line closest to it, and if it is like this for a long time, it is likely that there is something wrong with this jump. If it is the company's intranet, you can take this clue to find the corresponding network colleague. If it is an external network, then be patient and wait, other developers will be more anxious than you. What to do if packet loss occursHaving said so much, I just want to tell you that packet loss is very common and almost inevitable. But the question is, what should we do if packet loss occurs? This is easy to do, use TCP protocol for transmission. What is TCP After establishing a TCP connection, the sender will wait for the receiver to reply with an ACK packet after sending data. The purpose of the ACK packet is to tell the other party that the data has been received. However, if the packet is lost in the middle link, the sender will not receive the ACK for a long time, so it will retransmit the data. This ensures that each data packet has indeed reached the receiving end. Suppose the Internet is down and we are still using chat software to send messages. The chat software will use TCP to continuously try to retransmit the data. If the network is restored during the retransmission, the data can be sent normally. But if it fails after multiple retries until the timeout, you will get a red exclamation mark. Now the problem arises again. Suppose a green chat software uses the TCP protocol. Why did the girl mentioned at the beginning of the article lose the packet when her boyfriend replied to her message? After all, if the packet is lost, it will be retried, and if the retry fails, a red exclamation mark will appear. So, the question becomes, if we use the TCP protocol, will there be no packet loss? Will there be no packet loss if TCP protocol is used?We know that TCP is located at the transport layer, and there are various application layer protocols above it, such as the common HTTP or various RPC protocols. Four-layer network protocol The reliability guaranteed by TCP is the reliability of the transport layer. In other words, TCP only guarantees that data can be reliably sent from the transport layer of machine A to the transport layer of machine B. TCP does not care whether the data can reach the application layer after it reaches the transport layer at the receiving end. Suppose now, we enter a message, send it from the chat box, and it goes to the sending buffer of the transport layer TCP protocol. No matter whether there is packet loss in the middle, it is guaranteed to be sent to the other party's transport layer TCP receiving buffer through retransmission. At this time, the receiving end replies with an ack. After receiving this ack, the sending end will discard the message in its sending buffer. TCP's task is over here. The TCP task is finished, but the chat software task is not finished. The chat software also needs to read the data from the TCP receive buffer. If at this moment the phone is reading the data, the software may crash due to insufficient memory or other reasons. The sender thought that the message he sent had been sent to the other party, but the receiver did not receive the message. So the message was lost. Packet loss occurs when using TCP protocol Although the probability is small, it happened. It is reasonable and logical. So from here, I firmly concluded that my reader had already replied to the girl's message, but the girl failed to receive it because of packet loss. The reason for the packet loss was that the girl's mobile chat software crashed at the moment of receiving the message. At this point, the girl realized that she had wronged her boyfriend, and she cried and said that she must ask her boyfriend to buy her a newest iPhone that would not crash. Uh. Brothers, if you think I did the right thing, please give me a "positive energy" in the comment section. How to solve this kind of packet loss problem?The story has come to an end here. While we are all moved, let’s talk about something from the heart. In fact, everything I said before is correct, not a single word is false. But how could a certain green-skinned chat software be so mature that it not have considered this? You may remember that we mentioned at the beginning of the article that for the sake of simplicity, the server side was omitted, and the three-end communication became a two-end communication, which is why there is this packet loss problem. Now let's add the server back. Chat software three-way communication Have you ever noticed that sometimes we chat a lot on our phones, and then log in to the computer version, it can synchronize the latest chat records to the computer version. That is to say, the server may have recorded what data we have sent recently. Assuming that each message has an ID, the server and the chat software can compare the ID of the latest message every time to know whether the messages on both ends are consistent, just like reconciliation. For the sender, as long as he reconciles the content with the server regularly, he will know which message was not sent successfully and can just resend it. If the recipient's chat software crashes, after restarting it will communicate with the server briefly to find out which data is missing and synchronize it, so the packet loss mentioned above does not exist. It can be seen that TCP only guarantees the reliability of messages at the transport layer, but not at the application layer. If we also want to ensure the reliability of messages at the application layer, the application layer needs to implement logic to ensure it. So the question is, if the two ends can reconcile when communicating, why do we need to introduce a third server? There are three main reasons.
So you should understand by now that I removed the server not just for simplicity. Summarize
Finally, I leave you with a question: how does the mtr command know the IP address of each hop? |
<<: I have seven solutions for implementing real-time web messaging
>>: How to design a distributed ID generator?
On November 6, the 5th World Internet Conference ...
Cyber threats are an unfortunate reality for da...
With the implementation of 5G, the direction of m...
The tech industry has long sought to ally itself ...
Researchers at the National Institute of Standard...
In a local area network, we use VLAN to group dif...
I'm afraid everyone is familiar with Wi-Fi, a...
Previously, a joke mocking the operators caused a...
[[403151]] In recent days, the PR publicity of Oc...
However, you may have discovered a problem. The u...
The retail industry was born along with human civ...
The basic functions of mobile phones remain uncha...
With the orderly resumption of production and wor...
Recently, StreamNative solemnly announced the rel...
Documentation is often neglected in IT work. When...