Dewu App intercepts WiFi at 10,000 meters

Dewu App intercepts WiFi at 10,000 meters

0. Summary of the previous situation

During a flight, our customer encountered an abnormal problem with the Douyin App's WiFi network access on the plane. This made us realize that in certain scenarios, users may face the problem of not being able to use the Douyin App. After the SRE team, the wireless team, and the network team worked together to investigate and optimize, they finally successfully solved this problem, and at the same time discovered that the firewall devices on the entire network could not access the Douyin App in various C-end user work and life scenarios. This provides a guarantee for Douyin users to have stable access to Douyin, and also outputs a template for troubleshooting similar difficult problems.

1. Knowledge Express

1.1 What is In-flight WiFi Technology?

Currently, there are two main solutions for airborne WiFi services: air-to-ground broadband (ATG) wireless communication system and airborne satellite communication system (SATCOM).

  • The air-to-ground broadband (ATG) wireless communication system uses customized wireless transceiver equipment to set up ground base stations and air-to-air antennas along the flight route or specific airspace to form a ground-to-air communication link.
  • The airborne satellite communication system uses satellites, aircraft, and satellite ground stations for data communication.

The technical advantages and disadvantages of the two are compared as follows:

index

ATG

SATCOM

Latency

<100ms

Higher, 10-700ms, depending on satellite type and orbit altitude

Coverage

The area covered by the ground base station is mainly on land, with a maximum radius of 300km

Global in scope, including areas far from land and over oceans

Network connectivity

There may be signal blind spots between ground base stations

Higher connectivity via satellite signal transmission

reliability

May be affected by terrain and base station distribution

Affected by satellite signal strength and number of available satellites

Applicable scenarios

Mainly suitable for flying on land

Applicable for flights worldwide, including transoceanic routes

Musk's Starlink service uses a low-orbit satellite group, which is about 550 kilometers from the surface of the earth, and the delay is basically within 20ms. The satellites currently used for cabin communications in my country are basically synchronous satellites, which are 36,000 kilometers away from the earth, and the delay is basically above 500ms.

1.2 Why is TCP widely used in e-commerce business?

The TCP/UDP protocol used by the mainstream Internet communication protocol today complies with the 4-layer network model of TCP/IP. Compared with UDP, the TCP protocol provides reliable, connection-oriented communication:

  • Three-way handshake

Under the TCP protocol, before data transmission, the communicating parties need to establish a connection first. When establishing the connection, a series of handshake processes will be performed to ensure that the status and capabilities of the communicating parties are normal before data transmission.

  • Packet Acknowledgement Mechanism (ACK)

During data transmission, the TCP protocol divides the data into multiple packets for transmission, and checks and confirms each packet to ensure that the data can be transmitted correctly.

  • Congestion and flow control

The TCP protocol provides mechanisms such as congestion control and flow control, which can adaptively adjust the transmission rate to prevent network congestion and data loss.

Based on the above TCP protocol, the reliability and integrity of data are guaranteed, so the TCP protocol is widely used in e-commerce applications.

2. Coordinated investigation between heaven and earth

2.1 Plan formulation

After understanding the implementation technology of airborne WiFi, how to locate and troubleshoot the problem is a question that our SRE experts think about. For difficult and complicated tasks, there are always three tricks: simulate and reproduce the problem, capture packets, and analyze the complete request link. The trouble this time is that the scenario is special and needs to be reproduced in an airborne WiFi environment. At the same time, capturing packets is a technical job, so our technical team can only do it personally. Here, we have to give special praise to the client students of the wireless platform. In order to fully reproduce the scenario, they took a designated flight at 7 o'clock in the morning to test back and forth and collect important packet capture data.

2.2 Test plan & tool confirmation

Because the recurrence scenario is demanding (WiFi can only be turned on at an altitude of 10,000 meters), a complete test plan must be developed to collect as much data as possible. The wireless platform team and the SRE team have jointly prepared a test toolkit that can be used for network-level tests including ping and traceroute, APP-level request tests, single domain name access tests, etc. Packet capture tools are also prepared to retain all packet capture data during testing. SREs are on duty at the company to capture server request packets and make two-way comparisons.

The following are the troubleshooting tools for each protocol segment compiled by our SRE veteran, which you can save:

Although the TCP protocol has the advantages of connection-oriented and high reliability, in the actual network environment, due to factors such as network complexity, topology, and application defects, various network problems may occur. Below we have classified the troubleshooting tools and the 4-layer model:

When we troubleshoot, we usually eliminate suspicious points layer by layer from bottom to top, which will help us avoid detours in our daily work.

2.3 Problem reproduction & test packet capture

When the client tester connected to the plane's WiFi during cruising, he opened the Douyin App and found that he could not access the Douyin App. So the client asked the on-duty personnel to start the test.

(1) Open the Douyin App, browse different pages and take screenshots to ensure the scope of impact

(2) Conduct network tests including (ping, traceroute, etc.)

(3) Access typical interfaces separately in the browser, such as the main interface, community interface, image link, etc.

(4) Test other e-commerce platforms and observe their access conditions.

All the above accesses retain screenshots, logs, packet capture data, etc.

The on-duty SRE captures the ingress request packets of the interfaces at the same time, saves them, and then performs comparative analysis.

2.4 Data collation

2.4.1 Link Diagnosis

Network link layer test: Use ping/traceroute and other tools to test the domain names app.dewu.com/m.dewu.com, and all show that the network layer is normal.

Here is a brief introduction to the working principle of ping/traceroute tools

(1) Ping tool

Ping is a network diagnostic tool developed based on the ICMP protocol. It works at layer 3. Its working principle is to send an ICMP echo request data packet to the target host and wait for the echo response data packet to be received. Then the program automatically estimates the packet loss rate and the RTT of the data packet. Therefore, it is mainly used for the diagnosis of network connectivity and network delay.

The original author of this tool is Mike Muuss, who developed it in 1983. Later, macOS/Win/Linux successively implemented their own versions. Unless otherwise specified below, all related parameters or descriptions are mainly for the Linux version.

  • The ping tool is integrated in the iputils package, open source project https://github.com/iputils/iputils
  • A ping packet format based on the ICMP protocol

The red fields in the above figure belong to the more critical fields in the IP and ICMP protocol headers:

protocol

Fields

Value

Meaning and function

IP

Identification

1-65535

Unique identifier of the data packet. Another function is used for IP fragmentation. When the load of an IP packet exceeds 1480, the IP packet is divided into multiple pieces, and the identification of the multiple pieces remains the same.

IP

Flags

3 bits

It is used to indicate whether the IP data packet allows fragmentation and the location of each fragment. Its three bits are:

  • The first bit indicates whether fragmentation is allowed. If fragmentation is allowed, this bit is set to 1, otherwise it is set to 0. If this bit is 0, it means no fragmentation, and the entire IP datagram will be transmitted at one time; if it is 1, fragmentation is allowed.
  • The second bit indicates whether it is the last slice. If it is the last slice, the bit is set to 0, otherwise it is set to 1.
  • The third bit is the "more fragments" flag. If the transmitted IP datagram is fragmented into multiple fragments, but the current datagram is not the last fragment, this flag is set to 1; otherwise it is set to 0.

IP

TTL

1-255

Mainly used to control loops in the network to prevent IP packets from being endlessly forwarded on the network. This value decreases by 1 each time it passes through a router.

Tips: The default value in Linux networks is generally 64. Therefore, after capturing the packet on the server side and seeing the TTL value, 64-the current TTL value will tell you how many routers the packet has passed through.

IP

Protocol

1/2/6/17

Represents the upper layer protocol carried by the carrier:

1: ICMP, 2: IGMP, 6: TCP, 17: UDP

ICMP

Type

0-18

Excerpts explaining:

Echo reply (ICMP type 0): The ping command uses this type of packet to test the TCP/IP connection;

Destination Unreachable (ICMP Type 3): used to indicate that the destination network, host, or port is unreachable;

Reply request (ICMP type 8): The ping command uses this type of packet to test the TCP/IP connection;

ICMP

Identifier

Random/Specified Value

The Identifier field exists in both Echo Request and Echo Reply messages to help distinguish different ICMP sessions. When sending an Echo Request message, the sender will randomly generate a 16-bit identifier, and then when receiving a response packet, it will compare the identifier in the response message to confirm whether the response message is the one it sent.

ICMP

SequanceNumber

1-65535

When an ICMP Echo (ping) request message is sent to the target host, the value of the Sequence Number field usually starts at 0 and increases by one for each ICMP Echo request sent. When the target host receives the ICMP Echo request, it copies its Sequence Number value to the ICMP Echo Reply (ping response) message so that the requester can confirm that the response message it received is a response to the corresponding request.

ICMP

TimeStamp

Timestamp

Mainly used to measure RTT. When a host receives an ICMP Timestamp request, it records the current timestamp in the returned ICMP Timestamp response message and calculates the time difference between the request and the response.

  • Default values ​​of some ping parameters

parameter

Linux Defaults

illustrate

-t

64

Specify the ttl value

-c

Send an unlimited number of ICMP packets

Specify the number of times to send ICMP packets

-s

56 bytes

Specify the size of ICMP packets

-W

10 seconds

Specify the timeout period for each response packet in seconds

-i

1 sec

Specify the interval for sending ICMP packets

  • Exceptions to ping

From the principle of ping described at the beginning, it can be seen that the target device must reply with an echo response to determine network connectivity and latency. Therefore, if the target device is set to prohibit " net.ipv4.icmp_echo_ignore_all=0 " or the firewall is set to discard icmp packets, the test result is basically invalid. At this time, other tools such as telnet/nc/curl are needed for testing.

One particularly interesting point is that in versions s20190709 and earlier, the Identifier value is the pid of the current ping process, as shown in the following figure:

The pid of the current ping process is 2570, and the hexadecimal value is 0xa0a, so the 25th and 26th bytes in the packet are displayed as 0xa0a. Later versions are considered unsafe, so all are changed to random values.

 ping_common.c //s20190709版本和此前的版本if (sock->socktype == SOCK_RAW) ident = htons(getpid() & 0xFFFF); //之后的版本if (sock->socktype == SOCK_RAW && rts->ident == -1) rts->ident = rand() & IDENTIFIER_MAX;

(2) traceroute

  • Function and effect

Used to find the network paths that packets take from source to destination and identify bottlenecks and failures along those paths

  • How it works

It sends an IP packet with a TTL field of 1 to the destination host. The first router that processes this packet decrements the TTL value by 1, discards the packet, and sends a timeout ICMP message. This gives us the address of the first router in the path. Then traceroute sends a packet with a TTL of 2, so we can get the address of the second router. This process continues until the packet reaches the destination host.

The upper layer protocol carried by this data packet can be ICMP/UDP/TCP

  • Tool Development History

It was first implemented in 1987 by Van Jacobson. Later, macOS/Win/Linux/BSD also implemented their own versions. The mainstream Linux distribution basically uses the project https://traceroute.sourceforge.net/

  • Configuring initial values
 //代码配置节选#define MAX_HOPS 255 //最大跳数,限制traceroute 能够追踪到的最远节点的数量#define MAX_PROBES 10 //每个路由节点的最大探测次数#define DEF_HOPS 30 //默认的最大跳数#define DEF_NUM_PROBES 3 //默认的每个节点的探测次数#define DEF_WAIT_SECS 5.0 //默认的等待每个节点响应的时间#define DEF_DATA_LEN 40 //IP包上的负载默认大小#define MAX_PACKET_LEN 65000 //最大的包长度,默认为65000 字节#ifndef DEF_AF #define DEF_AF AF_INET // 默认的地址族,一般设置为AF_INET,表示IPv4 static const char *module = "default"; //默认使用udp进行探测static tr_module default_ops = { .name = "default", .init = udp_default_init, .send_probe = udp_send_probe, .recv_probe = udp_recv_probe, .expire_probe = udp_expire_probe, .header_len = sizeof (struct udphdr), }; #define DEF_START_PORT 33434 /* udp探测时启始探测端口*/
  • Packet format analysis

The code implementation logic and the entire detection process can be verified by capturing packets (tcpdump host 1.1.1.1 -Nn -w save file name .pcap)

Judging from the introduction and test data of the two tools, the network level is normal.

2.4.2 Application layer testing

The air side tested and verified our company's backend service interface from the dimensions of ios/android terminals, https/http, etc., and also conducted shopping experience on the apps of other companies. Only the interface of the Douyin App returned an exception (https/http), and when using the browser test, a page with an interception prompt was returned;

2.4.3 Network Packet Capture

The air side captured packets from the iOS side, and the ground side captured packets at the entrance of the high-defense system. From the perspective of the client/server side, both sides believed that the other side initiated a forced disconnect (reset) signaling: from the mobile side, it was believed that the high-defense system (server) disconnected first, and from the high-defense side, it was believed that the mobile phone (client) disconnected first.

iOS:

High-defense end:

Tcpdump is a very useful open source packet capture tool. It has always been one of the most important tools for our SRE. Here I share it with you:

Tcpdump is a powerful command line network packet capture tool.

By using tcpdump, you can capture and analyze data packet traffic in the network, so that you can diagnose network problems, monitor network behavior, and perform network security audits.

Tcpdump is also a very good tool for learning network protocols and data packet structure, and is used to analyze and decode network data packets.

  • tcpdump works at the data link layer

  • How it works
  1. Network packet capture: Capture packets transmitted from the specified network interface by calling the libpcap library. Specifically, the libpcap library uses the original socket interface provided by the operating system to intercept network packets, and then passes the packets to the tcpdump process through the callback mechanism.
  2. Packet filtering: tcpdump can filter the captured packets according to the filtering rules set by the user. The filtering rules are implemented using the BPF (Berkeley Packet Filter) filter, which is an instruction set-based filter that can match and filter each field of the packet to filter out packets that meet the conditions.
  3. Packet parsing: Once the packets to be intercepted are selected through the filter, tcpdump will parse and format these packets and display the various fields and attributes. The key to the packet parsing process is the identification and decoding of the packet format. The tcpdump and libpcap libraries can identify and process a variety of packet formats, including Ethernet, IPv4/IPv6, TCP/UDP, DNS, HTTP, etc.
  4. Data packet display: Finally, tcpdump outputs the parsed data packet content to the standard output or the user-specified file for the user to view and process. The user can further process the output content, such as filtering, sorting, and statistics, to better understand network data traffic, analyze network protocols and application behaviors, find problems, and optimize performance.

2.5 Packet Analysis

  • From the network link layer test data results, the network layer port is normal, including the TCP three-way handshake. This means that the entire network link is unobstructed. There is no network failure. (This problem can also be confirmed by the fact that friendly companies can access it normally)
  • From the network packet capture data, from the client/server side, both sides believe that the other side initiated a forced disconnect (reset) signaling: from the mobile side, it is believed that the high-defense (server) disconnected first, and from the high-defense side, it is believed that the mobile phone (client) disconnected first
  • From the screenshots at the application layer, it seems that it is intercepted by something like ACL

Conclusion: The most likely cause is interception by an intermediate device such as a firewall

2.6 Simulation Reproduction

From the following screenshots, it seems that the firewall of our company is from the same manufacturer, so I quickly organized a simulation verification with my network colleagues:

Enable the "Disable access to websites/software downloads" policy on the firewall for a certain terminal IP.

Then request https://app.dewu.com on the browser and find that this policy is hit

At the same time, packets are captured from the client and firewall exits at the same time:

Computer client side:

Firewall egress side:

Based on the above evidence chain, it can be basically confirmed that the firewall strategy misjudged the domain name of the company's Dewu App as a download website

2.7 Manufacturer Communication

After reporting the problem we reproduced to the manufacturer, the manufacturer's policy engineer confirmed that there was a bug in the "access website/software download" policy, and during the communication process, it was also confirmed that this airline and our company were using the same manufacturer's firewall.

2.8 Progress Synchronization

On April 18, the manufacturer released a full network strategy

On April 19, our AC equipment automatically updated its strategy

4/21, asked a friend to help verify that the Dewu App was working smoothly on the same flight, and the verification was passed

3. Review of network technology points

3.1 traceroute

From this traceroute data we can determine one thing:

  • The latency from airborne WiFi to the main domain names of domestic e-commerce companies is generally above 600ms. This data further confirms that airborne WiFi uses the SATCOM method (high latency).

3.2 IP Header

  • Under normal circumstances, the TTL value in the IP packet header will decrease by 1 each time it passes through a router until it stops changing when the server receives it. In other words, under normal circumstances, the value in the client and server packets should not change. If there is a significant change, it is basically tampered by the intermediate device.
  • Another field in the IP packet header, identifcation, is used to identify the uniqueness of the IP datagram. If an IP packet needs to be fragmented (MTU exceeds 1460), the identifcation value in each fragmented IP packet is the same; at the same time, the RFC791 specification does not specify how to take the value of the Identification field, but in reality we see that it increases in sequence (MTU exceeds 1500 bytes plus 1). If there is a large jump, it is basically tampered by the intermediate device.

3.3 AC Equipment Network Management

AC (Access Controller) is a centrally controlled network device used to manage the Internet access behavior of multiple APs (Access Points). The Internet access behavior management function of AC devices can help administrators monitor and manage users' Internet access behavior, including the following aspects:

  • Authentication and authorization management: AC devices can implement multiple authentication methods, such as wireless LAN security protocols (WPA, WPA2), 802.1X authentication, etc., to ensure the user's legal identity and authorize them to access the network.
  • Traffic control: AC devices can control the traffic of connected users, limiting their upstream and downstream bandwidth, total traffic volume and other parameters to avoid network congestion and a single user occupying too much bandwidth resources.
  • Internet behavior filtering: AC devices can filter users' network access, prohibiting users from accessing certain inappropriate websites or applications, and ensuring the security and stability of the network.
  • User management and monitoring: AC devices can uniformly manage and monitor all access users, including user identity, device type, MAC address, IP address, Internet access time, etc., to help administrators understand users' Internet habits and behavioral characteristics, and discover and handle abnormal behaviors.

Functionality:

User authentication, which can manage user logins based on users and user groups, and can configure local authentication or AAA authentication, etc.

URL filtering, using HTTP identification technology is to obtain the host field in the HTTP request to know the website the user wants to visit, so as to achieve the purpose of filtering websites (the problem this time is in this function)

  1. HTTP: After the three-way handshake, HTTP sends a request with a host field, from which we know the website to be visited

  2. HTTPS: After the three-way handshake, an SSL encrypted channel will be established. When the client sends a client hello message during the first SSL handshake, there will be a server name field (SNI), from which corresponding filtering can be performed.

References:

1) China Civil Aviation Network

2) China Eastern Airlines official website

3) Communication World

4) ZTE 5G Ground-to-Air Communications White Paper 2020

<<:  "Smart cars" drive on "smart roads", Lenovo and Intel work together to enable the coordinated development of vehicles, roads and clouds with 5G+AI computing power​

>>:  How does DH+ compare to Ethernet?

Recommend

IonSwitch: 20Gbps bandwidth/NVMe hard drive US VPS annual payment starts at $25

IonSwitch is a foreign hosting company founded in...

5G Factory Takes Over the Next Step of "5G+Industrial Internet"

As a product of the deep integration and applicat...

How will 5G impact the video surveillance and physical security industries?

[[407162]] It’s been a few years, but 5G (fifth g...

Building the future of intelligent networks at the edge

Edge computing is evolving and is the future of b...

Why Private LTE is a Smarter Choice than 5G

As connected technology continues to advance, bus...

Gcore (gcorelabs) Russian Khabarovsk VPS simple test

In the last article, I shared the simple test inf...

Message bus for communication between processes

[[381755]] 1. Inter-process communication (IPC) in...