1. BackgroundRecently, I encountered a particularly strange problem in the pre-release environment. The situation is roughly like this: The device needs to go through a registration process during production, which involves TCP communication with the server to obtain configuration files, send keys, and other operations, but the production progress is stuck at 70%. The process is shown in the figure below. You don't need to look into the principles in detail, just look at the D4 and D5 stages. Data communication method: TCP. picture The configuration file looks like this, stored in key=value format. When the name field in the configuration file is rabbit, the device produces normally. When the name field in the configuration file is rabbit-TD, the device cannot produce successfully and the production progress will be stuck at 70%. From the phenomenon, it is not clear whether the device did not execute the D5 stage or the server did not successfully process the D5 stage. 2. Troubleshooting process2.1. Check the codeCheck the device and server codes to see if there are any restrictions on the length of the name field. Conclusion: The device and the server do not limit the field length of the configuration file. 2.2. View the server logAfter checking the server logs, we found that only the business logs of the D4 phase were printed, but not the logs of the D5 phase. Preliminary conclusion: The device does not send data packets in the D5 phase. 2.3、Server-side packet captureIdea: Capture a packet to see if the server has received the data packet in the D5 phase. On the server side, use the Microsoft Network Monitor packet capture tool to capture packets, and then put the captured packet files into Wireshark for investigation. The figure below shows the TCP communication data between the device and the server. picture You can see that the device sent the configuration file to the server (D4 phase), and the server sent an ACK response. In TCP (Transmission Control Protocol) communication, when a client sends a TCP message to a server, the server usually sends an ACK (acknowledgement) response to indicate that it has successfully received the message. This is a reliable transmission mechanism based on TCP, which ensures that data can be transmitted correctly from the sender to the receiver. TCP uses sequence numbers and confirmation numbers to achieve reliable transmission. The sender assigns a sequence number to each byte sent, and the receiver sends an ACK confirmation after receiving the data. The confirmation number indicates the sequence number of the next byte that the receiver expects to receive. If the sender does not receive the ACK confirmation within a certain period of time, it will resend the data. (From AI) Preliminary conclusion: The server sent an ACK response in phase D4. The device did not send a data packet in phase D5. Note: This conclusion was overturned during the subsequent investigation. 2.4、Capture packets on the deviceIdea: Capture a packet to see if the server has sent a data packet in the D5 phase. Use the following command to capture a packet on the device: The captured data packets are as follows: picture From the packet capture results in the figure above, we can see that the last stage is D4 and D5, which actually merge the data packets together and send them (I discovered this later, and it is also the source of the 1024 card bug) In other words, D4 and D5 are actually one stage and are not developed separately. Then the device waits for the server to return the configuration file (P6 stage). Preliminary conclusion: The device executed the D5 stage, but the server did not execute the P6 stage. There is a problem with the server. 2.5. Check the data packet on the server againThis is embarrassing. The device has clearly executed the D5 stage, but the server does not seem to have received the D5 data packet. Looking at the last data packet again, the content of the message is shown in the following figure: picture Open the data packet of the D4 stage, and you can see that the data contains the configuration file content of the D4 stage and the file content of the D5 stage. I was confused when I saw this packet: I saw in the previous interface documentation that the data was sent separately in the D4 and D5 phases? How come they are sent together again? Reason: The device writes the data packets of D4 and D5 to the socket continuously. Preliminary conclusion: The server did not correctly process the combined data packet of D4 and D5. What can we do? We can only add more logs on the server to see why the D5 data packets are not processed correctly. 2.6. Analyze the data packet3.6.1 Message when name=rabbit (can be produced normally)Each stage sends a message in the following format: 0x1234abcd, length, type, data.
When the name field in the configuration file is rabbit, the combined message content of messages D4 and D5 is as follows: picture illustrate:
picture
3.6.2 Message when name=rabbit-TD (cannot be produced normally)When the name field in the configuration file is rabbit-TD, the combined message content of messages D4 and D5 is as follows: picture illustrate:
The contents of the log are as follows: picture
picture The log content is as follows: picture
2.7 The truth comes outWhen the read data message reaches 1024 bytes, the four bytes of the business data length are cut. The first 1024 bytes contain the first byte of the length field, and the last 3 bytes of the length field and 1 byte of the request type constitute the 4 bytes of the length field. That is, the content of the last byte is misread. Finally, the calculated length value is 65538, which is not equal to the following 256 bytes of business data, causing the server program to report an error, so the subsequent code is not executed. 3. Solution3.1. Solution 1The reason is that the 1 byte of length read previously is not combined with the three bytes of length read subsequently to form the value of the length field. In this case, you only need to ensure that the previous 1 byte is obtained when the length field is read for the second time. 3.2 Solution 2There is another solution to fix the bug: add a little content to the configuration file of the D4 stage to ensure that the content of the configuration file = 1014 + 1 = 1015, or greater than or equal to 1014 + 5 = 1019. The purpose is to insert the complete four bytes of the length field after 1024, or insert the four bytes of the starting data after 1024. Two cases were verified: Rabbit-TDDDDDDD and Rabbit-TDD were produced normally. The following is the case of Rabbit-TDD, which just filled up 1024 bytes with the data of D4 + the start data of D5. As shown in the following figure:
picture Let me explain to you again how to fix the bug so that the system can run normally. 1024 bytes = 1015 (configuration file message content) + 4 (configuration file message length) + 1 (request type) + 4 (D5 message start data). or 1024 bytes = 1019 (configuration file message content) + 4 (configuration file message length) + 1 (request type) = 1024 bytes. There are two more questions: - Why is the starting data of the D4 phase not counted in the 1024 bytes? I also don’t understand how the Socket data is split and combined for sending. - Why does the server split the read into next read after reading 1024 bytes? The technology stack is the mina framework. The problem occurs on Windows Server 2003, but it cannot be reproduced on Windows 10. |
>>: This article tells you how to realize the IP territorial function. Have you learned it?
The Internet of Things is not new. At the beginni...
When we watch spy movies, we often see undergroun...
The shortage of wireless spectrum has always been...
Many people have encountered this problem when co...
According to the financial report, China Telecom&...
Recently, after attending the report meeting on t...
[51CTO.com original article] During the critical ...
1. Overview of HTTP Status Codes 1. Concept When ...
With the development of 5G technology, more and m...
China Unicom recently released its financial resu...
According to foreign media reports, as an industr...
Some people say that the most profound change tha...
RackNerd has previously launched AMD Ryzen CPU+NV...
[51CTO.com original article] [Beijing, China, Dec...
The CPC Central Committee and the State Council a...