0. Summary of the previous situationDuring a flight, our customer encountered an abnormal problem with the Douyin App's WiFi network access on the plane. This made us realize that in certain scenarios, users may face the problem of not being able to use the Douyin App. After the SRE team, the wireless team, and the network team worked together to investigate and optimize, they finally successfully solved this problem, and at the same time discovered that the firewall devices on the entire network could not access the Douyin App in various C-end user work and life scenarios. This provides a guarantee for Douyin users to have stable access to Douyin, and also outputs a template for troubleshooting similar difficult problems. 1. Knowledge Express1.1 What is In-flight WiFi Technology?Currently, there are two main solutions for airborne WiFi services: air-to-ground broadband (ATG) wireless communication system and airborne satellite communication system (SATCOM).
The technical advantages and disadvantages of the two are compared as follows:
Musk's Starlink service uses a low-orbit satellite group, which is about 550 kilometers from the surface of the earth, and the delay is basically within 20ms. The satellites currently used for cabin communications in my country are basically synchronous satellites, which are 36,000 kilometers away from the earth, and the delay is basically above 500ms. 1.2 Why is TCP widely used in e-commerce business?The TCP/UDP protocol used by the mainstream Internet communication protocol today complies with the 4-layer network model of TCP/IP. Compared with UDP, the TCP protocol provides reliable, connection-oriented communication:
Under the TCP protocol, before data transmission, the communicating parties need to establish a connection first. When establishing the connection, a series of handshake processes will be performed to ensure that the status and capabilities of the communicating parties are normal before data transmission.
During data transmission, the TCP protocol divides the data into multiple packets for transmission, and checks and confirms each packet to ensure that the data can be transmitted correctly.
The TCP protocol provides mechanisms such as congestion control and flow control, which can adaptively adjust the transmission rate to prevent network congestion and data loss. Based on the above TCP protocol, the reliability and integrity of data are guaranteed, so the TCP protocol is widely used in e-commerce applications. 2. Coordinated investigation between heaven and earth2.1 Plan formulationAfter understanding the implementation technology of airborne WiFi, how to locate and troubleshoot the problem is a question that our SRE experts think about. For difficult and complicated tasks, there are always three tricks: simulate and reproduce the problem, capture packets, and analyze the complete request link. The trouble this time is that the scenario is special and needs to be reproduced in an airborne WiFi environment. At the same time, capturing packets is a technical job, so our technical team can only do it personally. Here, we have to give special praise to the client students of the wireless platform. In order to fully reproduce the scenario, they took a designated flight at 7 o'clock in the morning to test back and forth and collect important packet capture data. 2.2 Test plan & tool confirmationBecause the recurrence scenario is demanding (WiFi can only be turned on at an altitude of 10,000 meters), a complete test plan must be developed to collect as much data as possible. The wireless platform team and the SRE team have jointly prepared a test toolkit that can be used for network-level tests including ping and traceroute, APP-level request tests, single domain name access tests, etc. Packet capture tools are also prepared to retain all packet capture data during testing. SREs are on duty at the company to capture server request packets and make two-way comparisons. The following are the troubleshooting tools for each protocol segment compiled by our SRE veteran, which you can save: Although the TCP protocol has the advantages of connection-oriented and high reliability, in the actual network environment, due to factors such as network complexity, topology, and application defects, various network problems may occur. Below we have classified the troubleshooting tools and the 4-layer model: When we troubleshoot, we usually eliminate suspicious points layer by layer from bottom to top, which will help us avoid detours in our daily work. 2.3 Problem reproduction & test packet captureWhen the client tester connected to the plane's WiFi during cruising, he opened the Douyin App and found that he could not access the Douyin App. So the client asked the on-duty personnel to start the test. (1) Open the Douyin App, browse different pages and take screenshots to ensure the scope of impact (2) Conduct network tests including (ping, traceroute, etc.) (3) Access typical interfaces separately in the browser, such as the main interface, community interface, image link, etc. (4) Test other e-commerce platforms and observe their access conditions. All the above accesses retain screenshots, logs, packet capture data, etc. The on-duty SRE captures the ingress request packets of the interfaces at the same time, saves them, and then performs comparative analysis. 2.4 Data collation2.4.1 Link DiagnosisNetwork link layer test: Use ping/traceroute and other tools to test the domain names app.dewu.com/m.dewu.com, and all show that the network layer is normal. Here is a brief introduction to the working principle of ping/traceroute tools(1) Ping tool Ping is a network diagnostic tool developed based on the ICMP protocol. It works at layer 3. Its working principle is to send an ICMP echo request data packet to the target host and wait for the echo response data packet to be received. Then the program automatically estimates the packet loss rate and the RTT of the data packet. Therefore, it is mainly used for the diagnosis of network connectivity and network delay. The original author of this tool is Mike Muuss, who developed it in 1983. Later, macOS/Win/Linux successively implemented their own versions. Unless otherwise specified below, all related parameters or descriptions are mainly for the Linux version.
The red fields in the above figure belong to the more critical fields in the IP and ICMP protocol headers:
From the principle of ping described at the beginning, it can be seen that the target device must reply with an echo response to determine network connectivity and latency. Therefore, if the target device is set to prohibit " net.ipv4.icmp_echo_ignore_all=0 " or the firewall is set to discard icmp packets, the test result is basically invalid. At this time, other tools such as telnet/nc/curl are needed for testing. One particularly interesting point is that in versions s20190709 and earlier, the Identifier value is the pid of the current ping process, as shown in the following figure: The pid of the current ping process is 2570, and the hexadecimal value is 0xa0a, so the 25th and 26th bytes in the packet are displayed as 0xa0a. Later versions are considered unsafe, so all are changed to random values. (2) traceroute
Used to find the network paths that packets take from source to destination and identify bottlenecks and failures along those paths
It sends an IP packet with a TTL field of 1 to the destination host. The first router that processes this packet decrements the TTL value by 1, discards the packet, and sends a timeout ICMP message. This gives us the address of the first router in the path. Then traceroute sends a packet with a TTL of 2, so we can get the address of the second router. This process continues until the packet reaches the destination host. The upper layer protocol carried by this data packet can be ICMP/UDP/TCP
It was first implemented in 1987 by Van Jacobson. Later, macOS/Win/Linux/BSD also implemented their own versions. The mainstream Linux distribution basically uses the project https://traceroute.sourceforge.net/
The code implementation logic and the entire detection process can be verified by capturing packets (tcpdump host 1.1.1.1 -Nn -w save file name .pcap) Judging from the introduction and test data of the two tools, the network level is normal. 2.4.2 Application layer testingThe air side tested and verified our company's backend service interface from the dimensions of ios/android terminals, https/http, etc., and also conducted shopping experience on the apps of other companies. Only the interface of the Douyin App returned an exception (https/http), and when using the browser test, a page with an interception prompt was returned; 2.4.3 Network Packet CaptureThe air side captured packets from the iOS side, and the ground side captured packets at the entrance of the high-defense system. From the perspective of the client/server side, both sides believed that the other side initiated a forced disconnect (reset) signaling: from the mobile side, it was believed that the high-defense system (server) disconnected first, and from the high-defense side, it was believed that the mobile phone (client) disconnected first. iOS: High-defense end: Tcpdump is a very useful open source packet capture tool. It has always been one of the most important tools for our SRE. Here I share it with you: Tcpdump is a powerful command line network packet capture tool. By using tcpdump, you can capture and analyze data packet traffic in the network, so that you can diagnose network problems, monitor network behavior, and perform network security audits. Tcpdump is also a very good tool for learning network protocols and data packet structure, and is used to analyze and decode network data packets.
2.5 Packet Analysis
Conclusion: The most likely cause is interception by an intermediate device such as a firewall 2.6 Simulation ReproductionFrom the following screenshots, it seems that the firewall of our company is from the same manufacturer, so I quickly organized a simulation verification with my network colleagues: Enable the "Disable access to websites/software downloads" policy on the firewall for a certain terminal IP. Then request https://app.dewu.com on the browser and find that this policy is hit At the same time, packets are captured from the client and firewall exits at the same time: Computer client side: Firewall egress side: Based on the above evidence chain, it can be basically confirmed that the firewall strategy misjudged the domain name of the company's Dewu App as a download website 2.7 Manufacturer CommunicationAfter reporting the problem we reproduced to the manufacturer, the manufacturer's policy engineer confirmed that there was a bug in the "access website/software download" policy, and during the communication process, it was also confirmed that this airline and our company were using the same manufacturer's firewall. 2.8 Progress SynchronizationOn April 18, the manufacturer released a full network strategy On April 19, our AC equipment automatically updated its strategy 4/21, asked a friend to help verify that the Dewu App was working smoothly on the same flight, and the verification was passed 3. Review of network technology points3.1 tracerouteFrom this traceroute data we can determine one thing:
3.2 IP Header
3.3 AC Equipment Network ManagementAC (Access Controller) is a centrally controlled network device used to manage the Internet access behavior of multiple APs (Access Points). The Internet access behavior management function of AC devices can help administrators monitor and manage users' Internet access behavior, including the following aspects:
Functionality: User authentication, which can manage user logins based on users and user groups, and can configure local authentication or AAA authentication, etc. URL filtering, using HTTP identification technology is to obtain the host field in the HTTP request to know the website the user wants to visit, so as to achieve the purpose of filtering websites (the problem this time is in this function)
References: 1) China Civil Aviation Network 2) China Eastern Airlines official website 3) Communication World 4) ZTE 5G Ground-to-Air Communications White Paper 2020 |
>>: How does DH+ compare to Ethernet?
IonSwitch is a foreign hosting company founded in...
As a product of the deep integration and applicat...
With the rapid development of the Internet of Thi...
Shenyang University is one of the universities wi...
Recently, the three major operators released thei...
[[407162]] It’s been a few years, but 5G (fifth g...
Edge computing is evolving and is the future of b...
As connected technology continues to advance, bus...
In 2019, my country built more than 130,000 5G ba...
This weekend, we will share the cheap VPS package...
In the last article, I shared the simple test inf...
ZJI has launched the second wave of its September...
[[381755]] 1. Inter-process communication (IPC) in...
With the development of mobile Internet technolog...
BandwagonHost recently added VPS products for Chi...