Optimization practice of real-time 1V1 Q&A RUDP transmission with global 250 milliseconds delay

Optimization practice of real-time 1V1 Q&A RUDP transmission with global 250 milliseconds delay

Recently, I have talked with many friends in the field of real-time audio and video and talked about RUDP (Reliable UDP). This is actually a commonplace issue. RUDP is used in many famous projects, such as Google's QUIC and webRTC. Many friends think that it is unreliable to build a reliable layer on top of UDP, while others think that it is a killer that can solve most problems in the real-time field. As an educational company, Xuebajun does use RUDP technology to solve our problems in many real-time scenarios, and the RUDP methods we use in different scenarios are also different. Let's first take a look at the scenarios in which Xuebajun uses RUDP:

  1. Real-time 1V1 Q&A with a global latency of 250 milliseconds uses the RUDP + multi-point relay intelligent routing solution.
  2. The 500 millisecond 1080P video interactive system adopts the RUDP + PROXY scheduling transmission solution.
  3. The 6-party real-time synchronous writing system adopts the reliable transmission technology of RUDP+redo log.
  4. The 720P same-screen transmission system of the Pad under weak WIFI network adopts RUDP +GCC real-time flow control technology.
  5. The P2P distribution system for large-scale live broadcasts saves more than 75% of the distribution bandwidth through RUDP + multi-point parallel relay technology.

When it comes to real-time transmission, we will first consider RUDP. RUDP is used in all aspects of Xuebajun's core transmission system, but we have designed different RUDP methods for different system scenarios. So based on those heated discussions and our experience in using it, I will take a look at RUDP. In fact, there is a triangular balance relationship in the field of real-time communication: the constraint relationship between cost, quality, and latency (Figure 1)

Figure 1

That is to say, there is a triangle constraint (LEQ) relationship between the input cost, the obtained quality and the communication delay, so the designer of the real-time communication system will find a balance point under these three constraints. TCP is a communication method that guarantees quality by increasing delay and transmission cost, and UDP is a communication method that guarantees delay and cost by sacrificing quality, so RUDP is easier to find such a balance point in some specific scenarios. How does RUDP find this balance point? We must first analyze the reliability concept and usage scenarios of RUDP.

Reliable concept

In the real-time communication process, different demand scenarios have different requirements for reliability. We summarize them into three categories here:

  • Best-effort reliability: The receiver of the communication requires the sender's data to arrive as complete as possible, but the business data itself can be missing. For example: audio and video data, idempotent status data.

  • Unordered reliability: The receiver of the communication requires that the sender's data must arrive completely, but the order of arrival can be ignored. For example: file transfer, whiteboard writing, real-time graphic drawing data, log-type additional data, etc.

  • Ordered reliability: The communication receiver requires that the sender's data must arrive in order and in full.

RUDP determines its communication model and mechanism based on these three types of requirements and the triangular constraint relationship in Figure 1, that is, to find the balance point of communication.

Why is UDP reliable?

At this point, many people may say: Why bother? Just use TCP! Indeed, many people do this. TCP is a reliable communication protocol based on fairness. Under some harsh network conditions, TCP either cannot provide normal communication quality assurance or the cost is too high. Why do we need to make reliable guarantees on top of UDP? The reason is to reduce the cost as much as possible while ensuring the delay and quality of communication. RUDP mainly solves the following related problems:

  • End-to-end connectivity issues: Generally, direct communication between terminals involves NAT traversal. TCP traversal is very difficult to implement, but UDP traversal is much simpler. If it is end-to-end reliable communication, RUDP is generally used to solve it. The scenarios include: end-to-end file transfer, audio and video transmission, interactive command transmission, etc.

  • Transmission issues in weak network environments: In some WIFI or 3G/4G mobile networks, low-latency reliable communication is required. If TCP communication is used, the delay may be very large, which will affect the user experience. For example: real-time operation online game communication, voice conversation, multi-party whiteboard writing, etc. These scenarios can use a special RUDP method to solve such problems.

  • Bandwidth competition problem: Sometimes client data upload needs to break through the fairness limitations of TCP itself to achieve high speed, low latency and stability. That is to say, a special flow control algorithm must be used to squeeze the client upload bandwidth. For example, in live audio and video streaming, RUDP can be used to squeeze bandwidth and increase communication stability, avoiding frequent disconnections and reconnections similar to TCP.

  • Transmission path optimization problem: In some scenarios with high latency requirements, the application layer relay method is used to optimize the transmission route, that is, dynamic intelligent routing. At this time, both parties use RUDP to transmit, and the delay in the middle is used to optimize the delay by relay routing. There is also a type of scenario based on transmission throughput, such as data distribution and data backup between services. In this type of scenario, multi-point parallel relay is generally used to increase the transmission speed, which is also based on RUDP (these two points will be described in detail later).

  • Resource optimization problem: In some scenarios, in order to avoid the three-way handshake and four-way handshake process of TCP, RUDP is used to optimize resource occupancy and response time and improve the concurrency of the system, such as QUIC.

Regardless of the scenario, reliability, or quality, must be guaranteed. So how can reliability be achieved over UDP? The answer is retransmission.

Retransmission mode

The IP protocol was not designed for reliable data delivery, so UDP relies on retransmission to ensure reliability, which is the RUDP behavior in our usual sense. Before describing RUDP retransmission, let's first understand the basic framework of RUDP, as shown in the figure:

Figure 2

RUDP is divided into the sender and the receiver. Each RUDP will make different choices and simplifications during design, which can be summarized as the units in the figure. RUDP retransmission is the sender retransmitting data through the packet loss information feedback of the receiver's ACK. The sender will design its own retransmission method according to the scenario. The retransmission method is divided into three categories: scheduled retransmission, request retransmission and FEC selection retransmission.

Timed retransmission

Timed retransmission is easy to understand. If the sender does not receive the ACK message of the data packet after one RTO at the time of sending the data packet (T1), the sender will retransmit the data packet. This method relies on the ACK and RTO of the receiving end, which is prone to misjudgment. There are two main situations:

The other party received the data packet, but the ACK was lost during transmission.

The ACK is en route, but the sender has exceeded an RTO.

Therefore, the timeout retransmission method is mainly focused on the calculation of RTO. If your scenario is sensitive to latency but does not require high traffic costs, you can design the RTO calculation to be relatively small, so that your latency can be kept as small as possible. For example: real-time operation online games and writing synchronization in the education field are typical scenarios that use expense to exchange latency and Quality, and are suitable for low-latency transmission with small bandwidth. If it is real-time transmission with large bandwidth, the bandwidth consumption of scheduled retransmission is very large. In extreme cases, the retransmission rate will be 20%, so the request retransmission mode is generally used in the large bandwidth mode.

Request retransmission

Requesting retransmission means that the receiving end carries the information feedback of its lost message when sending ACK. When the sending end receives the ACK information, it retransmits the message according to the packet loss feedback. As shown in the following figure:

 

Figure 3

The most critical step in this feedback process is what information about lost packets should be carried when sending back ACK, because UDP will be out of order and jitter during network transmission. The receiving end needs to evaluate the jitter time of the network during communication, that is, rtt_var (RTT variance value). When packet loss is found, a time t1 is recorded. When t1 + rtt_var < curr_t (current time), we think it is lost. At this time, the subsequent ACK needs to carry this packet loss information and update the packet loss time t2. The packet loss queue is scanned continuously. If t2 + RTO < curr_t, the packet loss information is carried in ACK again, and so on, until the message is received. This method is a retransmission caused by packet loss request. If the network is very bad, the receiving end will continuously initiate retransmission requests, causing the sending end to retransmit continuously, causing network storms, and the communication quality will decrease. Therefore, we design a congestion control module on the sending end to limit the flow, which we will focus on later. In addition to network storms, the entire request retransmission mechanism also relies on two time parameters: jitter time and RTO. The evaluation and adjustment of these two parameters are closely related to the corresponding transmission scenarios. The request retransmission method has a larger delay than the scheduled retransmission method and is generally suitable for transmission scenarios with larger bandwidths, such as video, file transfer, data synchronization, etc.

FEC Select Retransmission

In addition to the timed retransmission and request retransmission modes, there is another way to select retransmission in FEC grouping mode. FEC (Forward Error Correction) is a forward error correction technology, which is generally implemented through an algorithm similar to XOR. There are also multi-layer EC algorithms and raptor spring code technology. In fact, it is a process of solving equations. The schematic diagram applied to RUDP is as follows:

Figure 4

When the sender sends a message, it will group several messages according to the FEC method, obtain several redundant packets through the XOR method, and then send them to the receiver together. If the receiver finds that the packet is lost but can be restored through the FEC grouping algorithm, it will not request retransmission from the sender. If the packet in the packet cannot be restored by FEC, it will request the original data packet from the sender. The FEC grouping method is suitable for solving transmission scenarios that require delay sensitivity and random packet loss. Under transmission conditions where the bandwidth is not very sufficient, FEC will increase redundant packets, which may make the network worse. The FEC method can not only cooperate with the request retransmission mode, but also with the scheduled retransmission mode.

Calculation of RTT and RTO

When introducing the retransmission mode above, time metrics such as RTT and RTO were mentioned many times. RTT (Round Trip Time) is the network loop delay. The loop delay is calculated by sending data packets and receiving ACK packets. The schematic diagram is as follows:

Figure 5

RTT = T2 - T1. This calculation method only calculates the RTT of a certain message time. However, the network will fluctuate, which will inevitably cause noise. Therefore, the weighted average convergence method is introduced in the calculation process (for details, please refer to RFC793).

SRTT = (α * SRTT) + (1-α)RTT. In this way, the new approximation of SRTT can be obtained. In the formula, α is generally 0.8. After determining SRTT, the next step is to calculate RTT_VAR (variance). We set RTT_VAR = |SRTT – RTT|

Then SRTT_VAR = (α * SRTT_VAR) + (1-α) RTT_VAR, so we can get the value of RTT_VAR, but in the end we need to get RTO, because it involves message retransmission. RTO is the retransmission period of a message. From the communication process of the network, we can easily know that after retransmitting a packet, if the time after RTT+RTT_VAR has not been received, then we can retransmit again. It can be known that:

RTO = SRTT + SRTT_VAR

However, in general, networks with severe jitter will still have a large repetition rate problem, so:

RTO = β*(SRTT + RTT_VAR)

1.2 <β<2.0, the value of β can be selected according to different transmission scenarios.

RUDP ensures reliability through retransmission. In the triangle balance relationship, retransmission actually uses expense and latency to exchange for quality. Therefore, retransmission will cause two problems: one is delay, and the other is retransmission bandwidth. Especially the latter, if not controlled well, it will cause network storms. Therefore, a window congestion mechanism is designed at the sending end to avoid the problem of excessive concurrent bandwidth usage.

Window and congestion control

  • window

RUDP requires a sliding window system for sending and receiving to cooperate with the corresponding congestion algorithm for flow control. Some RUDPs require strict correspondence between the windows of the sender and the receiver, while some RUDPs do not require strict correspondence between the sender and the receiver. If reliable and ordered RUDP is involved, the receiver must do window sorting and buffering. If it is an unordered reliable or best-effort reliable scenario, the receiver generally does not do window buffering, but only position sliding. Let's first look at the relationship diagram of the sender and receiver windows:

Figure 6

The figure above describes that the sender sends 6 data packets from the sending window to the receiver. When the receiver receives 101, 102, 103, and 106, it will first determine the continuity of the packets and slide the window start position to 103. Then each packet responds with ACK. When the sender receives ACK, it will confirm the continuity of the packets and slide the window to 103. The sender will then determine the vacancy of the window and fill it with new sending data. This is the entire window sliding process. It is worth mentioning here that when the receiver receives 106, if it is orderly and reliable, 106 will not notify the upper-layer business to process it, but wait for 104 and 105. If it is a best-effort reliability and unordered reliability scenario, 106 will be notified to the upper-layer business for processing first. After receiving ACK, the amount of sliding of the sender's window is determined by its own congestion machine, that is, the sliding speed of the window is controlled by the congestion mechanism. Congestion control is implemented either based on the packet loss rate or based on the communication delay between the two parties. Let's take a look at several typical congestion controls.

  • Classic congestion algorithm

The TCP classic congestion algorithm is divided into four parts: slow start, congestion avoidance, congestion handling, and fast recovery. These four parts are designed to determine the sending window and sending speed. In fact, they are designed to judge the network congestion status through network packet loss under the current network conditions, so as to determine a more suitable sending transmission window. The classic algorithm is based on timed retransmission. If RUDP uses this algorithm for congestion control, the general scenario is to ensure orderly and reliable transmission while taking into account the fairness principle of network transmission. Let's explain these parts one by one.

  • Slow start

When the connection link is just established, it is impossible to set cwnd to a large value at the beginning, which may easily cause a large number of retransmissions. In classic congestion, cwnd = 1 at the beginning, and then gradually increase cwnd according to the packet loss during the communication process to adapt to the current network status until the slow start threshold (ssthresh) is reached. The steps are as follows:

1) Initialize and set cwnd = 1, and start transmitting data

2) After receiving the feedback ACK, cwnd will be increased by 1

3) When a sender reaches an RTT and finds no packet loss or retransmission, it will set cwnd = cwnd * 2.

4) When cwnd >= ssthresh or packet loss and retransmission occur, slow start ends and the congestion avoidance state is entered.

  • Congestion Avoidance

When the communication connection ends the slow start, it may not reach the upper limit of the network transmission speed. At this time, it needs to be further adapted through a slow adjustment process. Generally, if no packet loss is found after one RTT, cwnd = cwnd + 1. Once packet loss and timeout retransmission are found, the congestion processing state is entered.

  • Congestion handling

Congestion handling is implemented very violently in TCP. If packet loss and retransmission occur, cwnd = cwnd / 2 is directly set, and then the fast recovery state is entered.

  • Fast recovery

Fast recovery is to determine whether to perform fast recovery by confirming that the packet loss only occurs in the packet at one position in the window. As described in Figure 6, if only 104 is lost, and 105 and 106 are received, then the ACK will always set the base of the ack to 103. If ACKs with base 103 are received three times in a row, fast recovery is performed, that is, 104 is retransmitted immediately. Then, if a new ACK is received and the base is > 103,

Set cwnd = cwnd + 1 and enter the congestion avoidance state.

Classic congestion control is designed based on packet loss detection and timed retransmission mode. In the triangle balance relationship, it is a typical case of trading latency for quality. However, since its fairness design avoids excessive expenses, it will make it difficult for this transmission method to squeeze network bandwidth and ensure high network throughput and low latency.

  • BRR Congestion Algorithm

To address the delay and bandwidth squeeze issues of the classic congestion algorithm, Google designed the BBR congestion control algorithm based on sender delay and bandwidth evaluation. This congestion algorithm is dedicated to solving two problems:

  1. Make full use of bandwidth on a network transmission link with a certain packet loss rate
  2. Reduce buffer delay in network transmission

The main strategy of BBR is to evaluate the min_rtt and max_bandwidth of the link through periodic ACK and NACK returns. The size of the maximum throughput (cwnd) is:

  1. cwnd = max_bandwidth / min_rtt

The transmission model is as follows:

Figure 7

The entire congestion control of BBR is a state of detecting bandwidth and pacing rate. There are two states:

Startup: Startup state (equivalent to slow start), gain parameter is max_gain = 2.85

DRAIN: Full load transmission status

PROBE_BW: Bandwidth evaluation status, incremented (1.25) or decremented (0.75) by a smaller BBR gain parameter.

PROBE_RTT: Delay evaluation state, RTT sampling is performed by maintaining a minimum send window (4 MSS).

So how do these states switch back and forth? The following are the general steps of BBR in QUIC:

  1. When initializing the connection, an initial cwnd = 8 is set and the state is set to Startup
  2. Send data in Startup, and determine whether the bandwidth can be increased based on the sampling periodicity of the ACK data. If so, set cwnd = cwnd * max_gain. If the time period exceeds the preset startup cycle time or packet loss occurs, enter the DRAIN state
  3. In the DRAIN state, if flight_size (the size of the data sent but not yet confirmed) > cwnd, continue to ensure the DRAIN state. If flight_size < cwd, enter the PROBE_BW state.
  4. In the PROBE_BW state, if no packet loss occurs and flight_size < cwnd * 1.25, the original cwnd will be maintained and StartUp will be entered. If packet loss occurs or flight_size > cwnd, cwnd = cwnd * 1.25. If packet loss occurs, cwnd = cwnd * .075
  5. In the Startup/DRAIN/PROBE_BW states, if RTT <= min_rtt does not occur during 10 seconds of communication, the state will enter the PROBE_RTT state and cwnd = 4 *MSS
  6. In the PROBE_RTT state, when receiving an ACK return, it will continue to check that flight_size >= cwnd and there is no packet loss, and use the minimum RTT counted this time as min_rtt to enter the Startup state.

BBR uses the above steps to periodically calculate cwnd, which is the maximum network throughput and minimum delay, and then determines the bit rate of the sender at this moment through the pacing rate, ultimately achieving the purpose of congestion control. BBR is suitable for congestion control when random packet loss and the network is stable. If it is on WIFI or 4G with extremely unstable network signals, network flooding and inaccurate prediction problems are prone to occur. In terms of multi-connection fairness, BBR also has the situation that connections with small RTT consume more bandwidth than connections with large RTT, which can easily cause the connection speed of large RTT to be too slow. In the triangle balance relationship, the BBR congestion algorithm is a case of using Expense in exchange for latency and Quality.

  • webRTC gcc

When it comes to audio and video transmission, the webRTC system will inevitably come to mind. In webRTC, a congestion control algorithm (gcc) is also implemented for video transmission. The gcc of webRTC is a congestion control based on the packet loss rate of the sender and the delay bandwidth statistics of the receiver. It is also a transmission algorithm that tries to be reliable. During the transmission process, if a message is resent too many times, it will be directly discarded. This is in line with the scenario of video transmission. Let's borrow a picture from weizhenwei to see what's going on:

Figure 8

The sender of gcc will pace the rate based on the packet loss rate and a comparison table. When loss < 2%, the transmission bandwidth will be increased. When loss >=2% &&loss <10%, the current bit rate will be maintained. When loss >=10%, the transmission will be considered overloaded and the transmission bandwidth will be reduced.

The receiving end of gcc performs KalmanFilter to converge bandwidth approximation based on the delay variance and size of the data arrival. The specific details are not introduced here, please refer to http://www.jianshu.com/p/bb34995c549a

It is worth mentioning here that gcc introduces KalmanFilter evaluation of bandwidth at the receiving end, which is a very novel congestion control idea. It is a good reference for implementing a reliable RUDP transmission system. However, this algorithm also has a flaw, that is, in the case of intermittent packet loss in the network, gcc may converge slowly, which may make it difficult for REMB to feedback to the sender to a certain extent, and the sender's flow control may fail. In the triangle balance relationship, gcc is a case of exchanging Quality and Expense for latency.

  • Weak window congestion mechanism

In fact, in many scenarios, congestion control is not needed or only very weak congestion control is needed. For example, teachers and students write synchronously or play real-time games. Because the amount of data transmitted is not large, it is sufficient to ensure small enough delay and reliability. Generally, a fixed window size is used for flow control. In our system, we generally use a window of cwnd = 32 for flow control. ACK confirmation is also fed back to the sender through the data status of the entire receiving window. It is simple and direct, and it is easy to adapt to weak network environments.

Transmission Path

In addition to optimizing connections, squeezing bandwidth, and adapting to weak network environments, RUDP also inherits the natural dynamics of UDP and can perform transmission optimization on the intermediate application layer link, which is generally divided into multi-point series optimization and multi-point parallel optimization. Let's talk about it in detail.

  • Multi-point series relay

In real-time communication, some business scenarios are very sensitive to delays, such as real-time voice, synchronous writing, real-time interaction, live broadcast and microphone connection, etc. If it is simply a service transfer or P2P communication, it is difficult to meet their needs, especially when the physical distance is large. In solving this problem, SKYPE took the lead in proposing the global RTN (Real-time Multipoint Transmission Network), which is actually a dynamic and intelligent routing between the two communicating parties through several relay nodes. This transmission method is very suitable for RUDP. We only need to build an RUDP channel between the two communicating parties. The intermediate link is just a stateless relay cache collection. Routing detection and routing are performed between relays to achieve high availability and real-time performance of the link. As shown in the following figure:

Fig. 9

RUDP is used to optimize transmission through multiple relays. This type of scenario is a typical example of exchanging expense for latency in the triangle balance relationship.

  • Multi-point parallel relay

When transmitting or distributing media data between services, it is necessary to ensure high availability of the transmission path and improve bandwidth concurrency. This type of usage scenario will also use the transmission parties to build an RUDP channel, which is solved by connecting multiple relay nodes in parallel, as shown in the following figure:

Fig.10

This model requires the design of a multi-point routing table detection mechanism at the sending end to determine the proportion and availability of data sent simultaneously on each path. In addition to link backup and increased transmission concurrent bandwidth, this model also has an auxiliary function. If it is a streaming media distribution system, we generally use BGP for transit. If nodes can be directly connected to each other, this can also reduce the occupancy of BGP bandwidth, thereby reducing cost issues.

postscript

This is the end of the introduction of RUDP. I have talked about some details and scenarios, which can be regarded as an entry-level popular science article. It has been almost 20 years since the concept of RUDP was proposed. Many practitioners hope to design a universal RUDP through a complete solution. I personally think this is unlikely. Even if it is designed, it is estimated to be similar to the current TCP, and there is little significance in doing so. The value of RUDP lies in the selection of different technologies according to different transmission scenarios. It may choose a relaxed congestion mode or a specific retransmission mode. But no matter what is chosen, it is based on the trade-offs between Expense, Latency, and Quality. By combining scenarios and the triangular balance relationship of trade-offs, RUDP may help developers find a better solution.

【About the Author】

Yuan Rongxi, senior architect of Xuebajun, 16-year C programmer, Golang enthusiast, good at building high-performance service systems and system performance tuning, likes to solve system problems and debug technology. In his early years, he was obsessed with P2P communication networks, TCP/IP communication protocol stacks and authentication and encryption technologies. He once implemented a real-time video transmission system based on P2P super node technology. He joined Xuebajun in 2015 and was responsible for building Xuebajun's intelligent routing real-time audio and video transmission system and network to solve the real-time problem of audio and video communication. He focuses on storage systems and concurrent programming, and is interested in paxos and raft distributed protocols. He especially likes database kernels and storage engines, and persists in exploring the implementation and transaction processing model of MySQL/innoDB and WiredTiger. He is keen on open source and has provided some patches to the open source community. In his spare time, he likes to write technical long articles and Tang poetry.

<<:  SDN changes the development path of router technology

>>:  The evolution of the Internet, the inevitability of blockchain

Recommend

Improving efficiency and reliability using SDN in multi-layer networks

Abstraction is a big issue in Software Defined Ne...

Header comparison between IPV4 and IPV6

An IP packet consists of two parts: header and pa...

Fifteen best practices for a successful data center migration

Data center migrations are often complex and risk...

The three major operators released their operating data for December

China Telecom's mobile user base increased by...

Image Gallery: TCP/IP Protocol Suite and Security

OSI and DoD Reference Model Relationship between ...

6 SD-WAN Challenges and Benefits

Software-defined WAN (SD-WAN) has obvious advanta...

Free VPS, Free VPS Merchants with $50-100, Free Trial VPS

The tribe mainly shares cheap VPS hosts. Although...

My girlfriend suddenly asked me what DNS is...

[[357457]] This article is reprinted from the WeC...