There is a 1024-bit bug. The TCP data packets are so annoying!

There is a 1024-bit bug. The TCP data packets are so annoying!

1. Background

Recently, I encountered a particularly strange problem in the pre-release environment. The situation is roughly like this:

The device needs to go through a registration process during production, which involves TCP communication with the server to obtain configuration files, send keys, and other operations, but the production progress is stuck at 70%.

The process is shown in the figure below.

You don't need to look into the principles in detail, just look at the D4 and D5 stages.

Data communication method: TCP.

picture

The configuration file looks like this, stored in key=value format.

 name=rabbit B2=asdf21 ...

When the name field in the configuration file is rabbit, the device produces normally. When the name field in the configuration file is rabbit-TD, the device cannot produce successfully and the production progress will be stuck at 70%.

From the phenomenon, it is not clear whether the device did not execute the D5 stage or the server did not successfully process the D5 stage.

2. Troubleshooting process

2.1. Check the code

Check the device and server codes to see if there are any restrictions on the length of the name field.

Conclusion: The device and the server do not limit the field length of the configuration file.

2.2. View the server log

After checking the server logs, we found that only the business logs of the D4 phase were printed, but not the logs of the D5 phase.

Preliminary conclusion: The device does not send data packets in the D5 phase.

2.3、Server-side packet capture

Idea: Capture a packet to see if the server has received the data packet in the D5 phase.

On the server side, use the Microsoft Network Monitor packet capture tool to capture packets, and then put the captured packet files into Wireshark for investigation.

The figure below shows the TCP communication data between the device and the server.

picture

You can see that the device sent the configuration file to the server (D4 phase), and the server sent an ACK response.

In TCP (Transmission Control Protocol) communication, when a client sends a TCP message to a server, the server usually sends an ACK (acknowledgement) response to indicate that it has successfully received the message. This is a reliable transmission mechanism based on TCP, which ensures that data can be transmitted correctly from the sender to the receiver.

TCP uses sequence numbers and confirmation numbers to achieve reliable transmission. The sender assigns a sequence number to each byte sent, and the receiver sends an ACK confirmation after receiving the data. The confirmation number indicates the sequence number of the next byte that the receiver expects to receive. If the sender does not receive the ACK confirmation within a certain period of time, it will resend the data. (From AI)

Preliminary conclusion: The server sent an ACK response in phase D4. The device did not send a data packet in phase D5.

Note: This conclusion was overturned during the subsequent investigation.

2.4、Capture packets on the device

Idea: Capture a packet to see if the server has sent a data packet in the D5 phase. Use the following command to capture a packet on the device:

 #tcpdump -i fetho host 192.168.1.253

The captured data packets are as follows:

picture

From the packet capture results in the figure above, we can see that the last stage is D4 and D5, which actually merge the data packets together and send them (I discovered this later, and it is also the source of the 1024 card bug)

In other words, D4 and D5 are actually one stage and are not developed separately.

Then the device waits for the server to return the configuration file (P6 stage).

Preliminary conclusion: The device executed the D5 stage, but the server did not execute the P6 stage. There is a problem with the server.

2.5. Check the data packet on the server again

This is embarrassing. The device has clearly executed the D5 stage, but the server does not seem to have received the D5 data packet.

Looking at the last data packet again, the content of the message is shown in the following figure:

picture

Open the data packet of the D4 stage, and you can see that the data contains the configuration file content of the D4 stage and the file content of the D5 stage. I was confused when I saw this packet:

I saw in the previous interface documentation that the data was sent separately in the D4 and D5 phases? How come they are sent together again?

Reason: The device writes the data packets of D4 and D5 to the socket continuously.

Preliminary conclusion: The server did not correctly process the combined data packet of D4 and D5.

What can we do? We can only add more logs on the server to see why the D5 data packets are not processed correctly.

2.6. Analyze the data packet

3.6.1 Message when name=rabbit (can be produced normally)

Each stage sends a message in the following format: 0x1234abcd, length, type, data.

  • 0x1234abcd : starting data
  • lenght: business data length
  • type: request type
  • data: business data

When the name field in the configuration file is rabbit, the combined message content of messages D4 and D5 is as follows:

picture

illustrate:

  • The value of the specified service data length must be equal to the length of the subsequent service data message (such as the data of the configuration file in the D4 stage and the length of the key data in the D5 stage). Otherwise, an error will be reported, which is also the root cause of the failure of the D5 stage to execute correctly.
  • The data length of the configuration file in stage D4 is 0x00 0x00 0x03 0xF4, which is 1011 in decimal.
  • When the server reads the D4 phase message, it first reads the 4-byte configuration file data length, then reads the 1-byte request type, and finally reads only the 1011-byte data. If the length of the business data is not equal to 1011, an error will be reported!
  • The D4 stage reads a total of 1016 bytes of data and then executes the logic of the D4 stage.
  • Next, the 4-byte message start data of the D5 phase is read, followed by the 4-byte service data length (hexadecimal 0x00 0x00 0x01 0x00 is converted to decimal 256). A total of 1024 bytes of data are read here, which just reaches the maximum length of 1024 for the server to read data, and will be divided into the next read. As shown in the figure below, the length of the service data is read completely.

picture

  • Then read 1 byte of request type data, and finally 256 bytes of key data.

3.6.2 Message when name=rabbit-TD (cannot be produced normally)

When the name field in the configuration file is rabbit-TD, the combined message content of messages D4 and D5 is as follows:

picture

illustrate:

  • The length of the data in the configuration file of stage D4 is 0x00 0x00 0x03 0xF6, which is 1014 in decimal.
  • When the server reads the D4 phase message, it first reads the 4-byte configuration file data length, then reads the 1-byte request type, and finally reads only 1014 bytes of data, a total of 1019 bytes of data are read here. Then the logic of the D4 phase is executed. There are no problems with the previous steps.
  • Then read the 4-byte message start data in the D5 phase, and 1023 bytes of data have been read.
  • Then read the length of the business data, lenth, and read 1 byte first, which just reaches the maximum length of the server-side data reading, 1024, and is divided into the next read. The problem arises here, the length of the business data is divided!

The contents of the log are as follows:

picture

  • The next time it reads, it will directly read 4 bytes of data as the length of the business data to be read. This causes a misalignment because the length of the business data has been read by one byte, so it can only read 4 bytes further.
  • As shown in the figure below: The original length of the business data in the D5 stage should be 256 bytes, but because of the misalignment, one byte of the request type type was read, and the final result is 0x00, 0x01, 0x00, 0x02, which is 65538 in decimal, but the business data in the D5 stage is only 256 bytes. This results in the inconsistency between the length of the transmitted business data and the length of the transmitted business data message, so there is a problem with the D5 data message parsed by the server. As shown in the figure below:

picture

The log content is as follows:

picture

  • Combined with the above description, here is a complete message data diagram:

2.7 The truth comes out

When the read data message reaches 1024 bytes, the four bytes of the business data length are cut. The first 1024 bytes contain the first byte of the length field, and the last 3 bytes of the length field and 1 byte of the request type constitute the 4 bytes of the length field. That is, the content of the last byte is misread. Finally, the calculated length value is 65538, which is not equal to the following 256 bytes of business data, causing the server program to report an error, so the subsequent code is not executed.

3. Solution

3.1. Solution 1

The reason is that the 1 byte of length read previously is not combined with the three bytes of length read subsequently to form the value of the length field. In this case, you only need to ensure that the previous 1 byte is obtained when the length field is read for the second time.

3.2 Solution 2

There is another solution to fix the bug: add a little content to the configuration file of the D4 stage to ensure that the content of the configuration file = 1014 + 1 = 1015, or greater than or equal to 1014 + 5 = 1019. The purpose is to insert the complete four bytes of the length field after 1024, or insert the four bytes of the starting data after 1024.

Two cases were verified: Rabbit-TDDDDDDD and Rabbit-TDD were produced normally. The following is the case of Rabbit-TDD, which just filled up 1024 bytes with the data of D4 + the start data of D5.

As shown in the following figure:

  • The left side shows the Rabbit-TD log, with the system reporting an error. 1023-4-5=1024 or 1014+5+5=1024.
  • The right side is the log of Rabbit-TDD, and the right side is executed normally. 1024-4+4=1024 or 1015+5+4=1024.

picture

Let me explain to you again how to fix the bug so that the system can run normally.

1024 bytes = 1015 (configuration file message content) + 4 (configuration file message length) + 1 (request type) + 4 (D5 message start data).

or

1024 bytes = 1019 (configuration file message content) + 4 (configuration file message length) + 1 (request type) = 1024 bytes.

There are two more questions:

- Why is the starting data of the D4 phase not counted in the 1024 bytes? I also don’t understand how the Socket data is split and combined for sending.

- Why does the server split the read into next read after reading 1024 bytes? The technology stack is the mina framework. The problem occurs on Windows Server 2003, but it cannot be reproduced on Windows 10.

<<:  Rui Headlines | Explore Ruijie Ethernet Color Light: The key advantage of achieving 20W+ indoor light

>>:  This article tells you how to realize the IP territorial function. Have you learned it?

Recommend

5G will become the golden key to open the era of the Internet of Things

The Internet of Things is not new. At the beginni...

Outlook on the Next Generation of Enterprise Wireless Technology - CBRS

The shortage of wireless spectrum has always been...

How to solve Wi-Fi authentication problems?

Many people have encountered this problem when co...

The turning point has arrived, and operators will face major changes in 2019

According to the financial report, China Telecom&...

Let’s talk about the complete guide to HTTP status codes. Have you learned it?

1. Overview of HTTP Status Codes 1. Concept When ...

How edge computing will benefit from 5G technology

With the development of 5G technology, more and m...

The core technical principles behind DingTalk document collaborative editing

Some people say that the most profound change tha...