Snapchat QUIC Practice: Small Protocol Solve Big Problems

Snapchat QUIC Practice: Small Protocol Solve Big Problems

Friends who are familiar with the Internet should all know that the next generation of application layer network transmission protocol HTTP3 will have a major change. The transmission protocol of the underlying network layer will change from the current TCP to UDP, and will be based on Google's open source QUIC protocol as the middle layer. We all know that the UDB protocol seems to be an unordered and unreliable protocol, but its transmission efficiency and parallel transmission can greatly improve the transmission performance. Therefore, the reliability of transmission under HTTP3 must be guaranteed by QUIC. QUIC has also gradually become a well-known protocol from the initial experiments to actual use. Google, Facebook, Amazon, Cloudflare, etc. have tried and used it online. In this article, we will learn about Snapchat's QUIC practice.

[[407507]]

status quo

Snapchat is a photo-sharing application developed by two Stanford students. Users can take photos, record videos, add text descriptions and pictures, and share them with their friends and fans through the Internet. Service performance, especially network performance, is critical. If there is a delay when sharing things, it will have a great impact.

According to their analysis: In Snapchat's service process, UI rendering and data disk persistence can be achieved within a few milliseconds, and the bottleneck is the delay on the network, which may take several seconds, and the error rate and hardware device restrictions are very high. In order to reduce network delays and errors, they use conventional practices such as reducing request packets and response packets, reducing unnecessary synchronization, and using global content distribution CDN vendors to accelerate. Of course, they also use new transmission technology, the next generation high-speed UDP Internet network protocol QUIC.

Old architecture

Snapchatters use a pre-QUIC network stack. Taking sending a Snap as an example, at the application layer, the Snap media is transmitted via an HTTP2 request. TLS is used at the security layer to secure the connection, and TCP is used to split the request into chunks to upload the Snap to the server. However, the TCP+TLS+HTTP2 stack is not ideal for mobile network environments. For example, if a Snapchatter switches between WiFi and WAN, the TCP request will fail. For users chatting with friends, the inability to send messages due to a broken connection will result in a degraded experience.

QUIC Advantages

QUIC is an Internet transport protocol developed by Google engineers. QUIC is the foundation of HTTP3, based on UDP, to replace HTTP2's TCP+TLS+. QUIC solves many transport layer and application layer problems, while requiring almost no changes from application developers. As shown in the figure above, QUIC does not change the network layer network protocol, nor does it require changes to the high-level HTTP protocol.

Compared with HTTP2's TCP+TLS+protocol stack, QUIC has made improvements in the following aspects:

  • Faster connection establishment: QUIC supports zero round trips for handshake, reducing 1-3 round trips for TCP+TLS before sending payload.
  • Improved congestion control: QUIC has pluggable congestion control and provides richer information to the congestion control algorithm than TCP. For example, QUIC BBR v1 and QUIC BBR v2.
  • Multiplexing without head-of-line blocking: For HTTP2 connections, when a TCP packet is lost, any stream on that connection cannot move forward until the packet is retransmitted and received by the remote end. This results in increased latency and can degrade the user experience on mobile network connections. QUIC eliminates this stalling of other streams multiplexed on the same connection.
  • Connection migration across IP addresses: TCP requests fail if the IP changes. QUIC connections are identified by a 64-bit identifier randomly generated by the QUIC protocol layer, so clients using QUIC can continue ongoing requests without interruption when the IP address changes, thereby achieving an undisturbed user experience.
  • Connection loss detection: QUIC can quickly detect connection loss and avoid long-term request suspension.
  • These advantages of QUIC are very suitable for solving Snapchat's problems and greatly improve its user experience.
  • Faster connection establishment: Before QUIC, p90 connection establishment took up to 300 milliseconds. This connection setup delay translates into user waiting delays and prevents users from receiving snapshots and viewing information. QUIC's faster connections directly reduce user waiting delays and improve the user experience.
  • Improved congestion control: Snap media uploaded by Snapchat can be larger than 10MB in size. Better congestion control algorithms improve throughput and reduce latency and error rates, especially for large media.
  • Multiplexing, no head-of-line blocking: Snapchat has a rich use case for short-form content, including Snaps, Stories, Discover content, etc. There are often multiple download streams using the same connection. QUIC eliminates HTTP2 header blocking issues, such as avoiding sending a message request that blocks a spotlight request.
  • Connection migration across P addresses: When you are with friends, the inability to send messages due to a WiFi connection interruption may lead to a degraded experience. Connection migration solves this pain point.
  • Detecting connection loss: Long loading spinners due to connection loss are disconcerting, especially when a Snapchatter is in full-screen mode enjoying content. With QUIC, we can detect and retry when a request fails due to connection loss while providing a user-friendly UI.

Effect

Snapchat's client network stack is built on top of the open source mobile network library Cronet. Snap uses Cronet to implement QUIC and also improves service observability through rich metrics and logs, building a unified view of client and server network performance.

According to Snapchat's comparison of different protocols in different regions, overall, enabling QUIC improved p90/P99 network latency by 6-20% and reduced network errors by 3%-8%. There are more improvements for low-network connection user groups.

Snapchat enabled QUIC on its ad service in October 2019. Significant improvements in P90/P99 latency and error rates were observed.

As shown above, the error rates for all error codes have improved, including connection timeout, connection loss, and request timeout. In a further breakdown of latency improvements by country and region, it can be observed that countries and regions with relatively poor network quality and greater geographical distance to the service have higher latency improvements.

In the second example, enabling BBR congestion control on the client-to-server path on top of QUIC also resulted in significant latency improvements. There are further improvements with larger request payloads.

In the final example, by enabling connection migration on Android, the success rate of network requests when the Wi-Fi connection is lost is improved by 20%.

Summarize

Snapchat's QUIC practice has achieved very good results. By using new technologies, it has solved service pain points, improved performance, and greatly improved the user experience. Using small protocols to solve big problems can be used directly or used as a research report to convince leaders to improve the architecture.

<<:  Telecom APIs: A critical IT tool for expanding services and improving customer experience

>>:  Kubernetes network technology analysis: Pod communication based on routing mode

Recommend

API Gateway: Layer 8 Network

An API is a set of rules that govern the exchange...

Perhaps it is easier to understand HTTPS this way

We won’t talk about HTTP and HTTPS first. Let’s s...

The Matter protocol is rising rapidly. Do you really understand it?

The topic we are going to talk about today is rel...

Single-mode fiber: What's next?

As the demand for high-speed, reliable networks c...

Let’s talk about deterministic networks

Low latency in the network is particularly import...

5G network speed is not as fast as 4G. Is this a trick of the operators?

Do you often hear descriptions like “5G Internet ...

The ransomware incident is a microcosm of global cybersecurity

On May 12, more than 75,000 computer virus attack...