When infrastructure fails, how does Agora SD-RTN ensure high availability of RTE services?

The emergence of cloud computing has brought great convenience to enterprise management, business development, resource integration, etc., and is also one of the core infrastructures of digital construction. However, local or large-scale downtime events are inevitable for cloud vendors, and the world's leading computing platforms are no exception. For example, at 10:45 a.m. Eastern Time on December 7, Amazon AWS experienced a downtime, which affected the online services of some websites such as Disney+ and Netflix. This failure also attracted great attention in the industry.

The reason why cloud vendors’ downtime cannot be 100% avoided is that there are many reasons for it, such as human error, network interruption or regional network congestion, power outages, natural disasters, etc. As a cloud vendor, what we can do is to continuously optimize technology and services to deal with these problems and minimize the probability of downtime.

As the world's leading real-time interactive cloud service provider, Agora also uses AWS infrastructure resources for some of its overseas businesses. During the AWS outage, Agora's real-time audio and video services were not affected. The core reason behind this is that the unique architectural design of Agora's SD-RTN™ network ensures the high availability of RTE (real-time interactive) services, so that even if the computer room, hardware, network and other infrastructure fail, it can still provide users with highly available RTE services.

First of all, we need to understand what high availability is. Generally speaking, a reliable cloud service must have very high availability. The availability evaluation standard SLA: Service Level Agreement is a guarantee of service availability for cloud vendors. Many domestic cloud vendors promise 99.9% availability when selling cloud services. The more 9s, the longer the service is available throughout the year and the more reliable the service is, and vice versa. For example, if we calculate 365 days a year, with 99.9% availability, only 8.76 hours of service are unavailable each year. Every improvement in availability is a technical challenge. When encountering environmental disasters, unreliable public network infrastructure and other problems, how to quickly deal with these problems, how long it will take to recover, and whether there is a mature backup plan are issues that any cloud vendor must honestly face.

To improve the availability of services, it is necessary to make arrangements at multiple levels, such as computer room layout, service infrastructure, and operation and maintenance automation. So how does Agora ensure the high availability of RTE services in practice? We can talk about it from four levels:

1. SD-RTN™ Architecture Design: Real-time Fault Perception and Intelligent Scheduling, Multi-site Active-Active

Business architecture: It is well known that infrastructure may be unavailable for a period of time due to sudden network congestion, hardware failure, force majeure and other factors. Under this premise, the architect team of Agora SD-RTN™ has fully considered the instability of infrastructure from the beginning of design. If we use a few keywords to describe SD-RTN™, they are global coverage, real-time fault perception and intelligent scheduling, ultra-low latency, elasticity, multi-site active-active, and ultra-high concurrency. Once the infrastructure fails, SD-RTN™'s real-time fault perception and intelligent scheduling capabilities and the construction method of multi-site active-active will play an important role in ensuring high availability of services.

1. Real-time fault perception and intelligent scheduling: Globally, public networks fluctuate frequently. SD-RTN™'s network sniffing service can perceive network quality in real time. Combined with the analysis capabilities of AI Ops (intelligent operation and maintenance), it can achieve minute-level user migration and ensure users' audio and video experience.

2. Multi-site active-active: The SD-RTN™ network divides global resources into multiple regions, and can still meet the minimum N+3 resource redundancy requirement within a region (that is, when the largest three resource clusters are unavailable, the remaining resources can still take on the load of the current region). Not only that, the regions can still form a complementary situation. When a region fails, it can be taken over by the complementary region.

3. Flexible expansion and contraction capabilities: Each region of the SD-RTN™ network has at least 200% real-time elastic expansion and contraction capabilities, has the ability to respond to emergencies, and can fully and reasonably use resources with intelligent scheduling.

SDK: At the same time, SoundNet has also carried out a lot of optimization work on the audio and video SDK side, including optimization for weak network resistance and audio and video experience optimization, forming a situation of "internal and external cooperation" with the business layer to improve the availability of services.

2. Infrastructure level: Global distribution of computer rooms, resource coverage in five locations and three centers

Basic resource site selection: SD-RTN™ has deployed more than 250 data centers around the world, covering more than 200 countries and regions. The minimum requirement for major regions is resource coverage of three centers in five locations, and each region adopts a core node + POP point approach. In this way, once one or two computer rooms in a region fail, technology can be used to switch all traffic in the failed city to a normal computer room.

Supply chain management: Do not rely on a single supplier for basic resources (including computer rooms, hardware, networks, etc.). When a supplier has a problem, you can quickly switch to other suppliers with normal services.

3. Intelligent operation and maintenance to quickly block faults

There is a consensus in the industry today that the complexity of operation and maintenance is increasing rapidly, but traditional operation and maintenance is already stretched to the limit. To this end, Agora has invested huge resources and manpower to overcome the difficulties in the implementation of AI engineering, and fully applied intelligent operation and maintenance to the daily operation and maintenance of SD-RTN™, solving the pain points of traditional operation and maintenance: 7*24H uninterrupted guarantee; highly consistent and high-quality execution results; unified and efficient operation and maintenance efficiency.

The AI Ops (intelligent operation and maintenance) of the sound network can identify the abnormality of the computer room and automatically operate and maintain it within 1 minute (including the overall end-to-end time of data aggregation, reporting, judgment, execution, recovery, etc.), quickly blocking the spread of the fault impact and ensuring the high availability of edge services. For example, network congestion of edge nodes is inevitable. After congestion occurs, the user's audio and video experience will be discounted (stuttering, increased latency). In this case, experienced operation and maintenance personnel will spend an average of 20 minutes from fault discovery to processing during daytime. If the fault occurs late at night or is not handled in time, it will take longer, which has a great impact on the user experience. At this time, the value of AI OPS is reflected. It can identify and handle abnormalities within 2.5 minutes, and execute 7*24 non-stop with high consistency to ensure users' high-quality RTC experience.

4. XLA, the first quality of experience standard in the RTE industry

As we mentioned earlier, SLA is the criterion for service availability for many cloud vendors and the telecommunications industry, but in Agora's view, SLA regulates equipment and network access standards and focuses on service availability. However, in the RTE industry, simply meeting the "availability" standard is far from enough. Users desire clear, smooth, and uninterrupted audio and video interactions, so the quality of the real-time interactive experience must meet the "easy to use" standard. In response to this, Agora designed, defined, and launched the first experience quality standard in the real-time interactive industry in July 2020 - XLA (Experience Level Agreement), which is also the first quantifiable, verifiable, and compensable experience quality standard for the availability and experience quality of RTE services.

Unlike SLA, XLA not only cares about the availability and service quality of real-time interaction, but also the user experience quality. It is also the first standard to shift the focus of quality assurance from devices to people. XLA mainly includes four experience indicators, namely 5s login success rate, 600ms video freeze rate, 200ms audio freeze rate and 400ms network delay compliance rate. The monthly compliance rate of the four indicators (1-total duration of non-compliant slices/total monthly duration) must exceed 99.5%. The 5s login success rate means that the successful login time must be less than 5 seconds to be qualified. This indicator mainly tests the availability and waiting experience of real-time interaction; the 600ms video freeze rate and 200ms audio freeze rate mainly test the smoothness experience during real-time interaction; the 400ms network delay indicator is aimed at the real-time nature of audio and video interaction, and the delay must be less than 400ms.

Through XLA, customers can obtain Agora's commitment and guarantee on the quality of real-time interactive experience in multiple dimensions, such as login success rate, end-to-end latency, and audio and video freeze rate. They no longer need to worry about the quality of experience of end users, and can truly use it with confidence and satisfaction!

Defining the quality standards for real-time interactive experience may seem like just a few indicators, but in fact, they carry the long-term efforts of the Agora team. The launch of the XLA quality standard was the result of repeated polishing, improvement, and verification by hundreds of technical experts on full-link data. It has gone through 10 versions of repeated iterations, adapted to 50+ network models, optimized for 200+ countries and regions, optimized for 6,000+ different types of terminal experiences, and polished 1 trillion minutes of data for the entire link. This also represents Agora's long-term cultivation and accumulation in the real-time interactive cloud industry.

<<: We haven’t experienced 5G yet, but 6G is coming?

>>: What are the highlights of the communications industry in 2022?

Efficient transfer tips, revealing the pros and cons of Rsync and SCP, helping you make a wise choice!

[11.11] DogYun: 30% off on Elastic Cloud, 20% off on Classic Cloud, 100 yuan off on Dedicated Server/month, 1 yuan free for every 11 yuan spent

Blog

If you unplug the SIM card, turn off the phone, and use a non-smartphone, the itinerary code will not know your whereabouts?

Blog

Sharktech: Los Angeles high-defense 1Gbps unlimited traffic server $59/month, E5 high-defense 1Gbps unlimited traffic server $99/month

Blog

DogYun Hong Kong server monthly discount of NT$100, starting from NT$300 per month, E5-2637v2/16GB/480G SSD/10M bandwidth

Blog

Do you feel that 4G is slowing down? Too many users is the main reason and it has nothing to do with 5G promotion

Blog

The data of tens of millions of JD.com users are suspected to have been leaked. Human greed has given rise to the "data black industry".

Blog

Recommend

Talk about RocketMQ master-slave replication

RocketMQ master-slave replication is one of Rocke...

If only the Canvas tag is left

[[420999]] 1. Background If only the canvas tag i...

Sharktech Spring Promotion, 10Gbps unlimited traffic high-defense server starting from $299/month, 2*E5-2678v3/128G/2TB NVMe/4 computer rooms including Los Angeles

Sharktech has launched a spring promotion, with d...

EtherNetservers special VPS starting from $14.95 per year - 1GB/40G SSD/1TB@10Gbps/Los Angeles & Miami & New Jersey data centers

EtherNetservers is a foreign hosting company esta...

The reality of "5G + Industrial Internet" still exists: people don't know how to use it, don't dare to use it, and can't afford it

"Industrial Internet" has been written ...

When infrastructure fails, how does Agora SD-RTN ensure high availability of RTE services?

Efficient transfer tips, revealing the pros and cons of Rsync and SCP, helping you make a wise choice!

Crunchbits: $7/month VDS-7GB/400GB SSD/20TB/Spokane

Microsoft's Zhang Dongmei: Intelligent data discovery will be a hot topic in data analysis in 2017

What? 5G early packages are released

[11.11] DogYun: 30% off on Elastic Cloud, 20% off on Classic Cloud, 100 yuan off on Dedicated Server/month, 1 yuan free for every 11 yuan spent

If you unplug the SIM card, turn off the phone, and use a non-smartphone, the itinerary code will not know your whereabouts?

Sharktech: Los Angeles high-defense 1Gbps unlimited traffic server $59/month, E5 high-defense 1Gbps unlimited traffic server $99/month

DogYun Hong Kong server monthly discount of NT$100, starting from NT$300 per month, E5-2637v2/16GB/480G SSD/10M bandwidth

Do you feel that 4G is slowing down? Too many users is the main reason and it has nothing to do with 5G promotion

The data of tens of millions of JD.com users are suspected to have been leaked. Human greed has given rise to the "data black industry".

Recommend

Talk about RocketMQ master-slave replication

If only the Canvas tag is left

Sharktech Spring Promotion, 10Gbps unlimited traffic high-defense server starting from $299/month, 2*E5-2678v3/128G/2TB NVMe/4 computer rooms including Los Angeles

Guava RateLimiter: A practical guide to efficient flow control

Soul's three major synchronization strategies for configuring cache for high-availability gateways

How many people can't tell the difference between wireless charging and wireless power supply?

Boundaries are meant to be broken. See how Huawei's intelligent computing accelerates enterprise transformation!

How fast is 5G? How does the 5G network work?

EtherNetservers special VPS starting from $14.95 per year - 1GB/40G SSD/1TB@10Gbps/Los Angeles & Miami & New Jersey data centers

How do weak-current system devices in different network segments communicate with each other?

A Brief Discussion on WebSocket Protocol-RFC 6455

How far is 400G from true commercial deployment?

The development trend of deception defense from the perspective of new honeypot technology

IDC: Strong growth in enterprise WLAN market in the fourth quarter and full year 2021

The reality of "5G + Industrial Internet" still exists: people don't know how to use it, don't dare to use it, and can't afford it