The emergence of cloud computing has brought great convenience to enterprise management, business development, resource integration, etc., and is also one of the core infrastructures of digital construction. However, local or large-scale downtime events are inevitable for cloud vendors, and the world's leading computing platforms are no exception. For example, at 10:45 a.m. Eastern Time on December 7, Amazon AWS experienced a downtime, which affected the online services of some websites such as Disney+ and Netflix. This failure also attracted great attention in the industry. The reason why cloud vendors’ downtime cannot be 100% avoided is that there are many reasons for it, such as human error, network interruption or regional network congestion, power outages, natural disasters, etc. As a cloud vendor, what we can do is to continuously optimize technology and services to deal with these problems and minimize the probability of downtime. As the world's leading real-time interactive cloud service provider, Agora also uses AWS infrastructure resources for some of its overseas businesses. During the AWS outage, Agora's real-time audio and video services were not affected. The core reason behind this is that the unique architectural design of Agora's SD-RTN™ network ensures the high availability of RTE (real-time interactive) services, so that even if the computer room, hardware, network and other infrastructure fail, it can still provide users with highly available RTE services. First of all, we need to understand what high availability is. Generally speaking, a reliable cloud service must have very high availability. The availability evaluation standard SLA: Service Level Agreement is a guarantee of service availability for cloud vendors. Many domestic cloud vendors promise 99.9% availability when selling cloud services. The more 9s, the longer the service is available throughout the year and the more reliable the service is, and vice versa. For example, if we calculate 365 days a year, with 99.9% availability, only 8.76 hours of service are unavailable each year. Every improvement in availability is a technical challenge. When encountering environmental disasters, unreliable public network infrastructure and other problems, how to quickly deal with these problems, how long it will take to recover, and whether there is a mature backup plan are issues that any cloud vendor must honestly face. To improve the availability of services, it is necessary to make arrangements at multiple levels, such as computer room layout, service infrastructure, and operation and maintenance automation. So how does Agora ensure the high availability of RTE services in practice? We can talk about it from four levels: 1. SD-RTN™ Architecture Design: Real-time Fault Perception and Intelligent Scheduling, Multi-site Active-Active Business architecture: It is well known that infrastructure may be unavailable for a period of time due to sudden network congestion, hardware failure, force majeure and other factors. Under this premise, the architect team of Agora SD-RTN™ has fully considered the instability of infrastructure from the beginning of design. If we use a few keywords to describe SD-RTN™, they are global coverage, real-time fault perception and intelligent scheduling, ultra-low latency, elasticity, multi-site active-active, and ultra-high concurrency. Once the infrastructure fails, SD-RTN™'s real-time fault perception and intelligent scheduling capabilities and the construction method of multi-site active-active will play an important role in ensuring high availability of services. 1. Real-time fault perception and intelligent scheduling: Globally, public networks fluctuate frequently. SD-RTN™'s network sniffing service can perceive network quality in real time. Combined with the analysis capabilities of AI Ops (intelligent operation and maintenance), it can achieve minute-level user migration and ensure users' audio and video experience. 2. Multi-site active-active: The SD-RTN™ network divides global resources into multiple regions, and can still meet the minimum N+3 resource redundancy requirement within a region (that is, when the largest three resource clusters are unavailable, the remaining resources can still take on the load of the current region). Not only that, the regions can still form a complementary situation. When a region fails, it can be taken over by the complementary region. 3. Flexible expansion and contraction capabilities: Each region of the SD-RTN™ network has at least 200% real-time elastic expansion and contraction capabilities, has the ability to respond to emergencies, and can fully and reasonably use resources with intelligent scheduling. SDK: At the same time, SoundNet has also carried out a lot of optimization work on the audio and video SDK side, including optimization for weak network resistance and audio and video experience optimization, forming a situation of "internal and external cooperation" with the business layer to improve the availability of services. 2. Infrastructure level: Global distribution of computer rooms, resource coverage in five locations and three centers Basic resource site selection: SD-RTN™ has deployed more than 250 data centers around the world, covering more than 200 countries and regions. The minimum requirement for major regions is resource coverage of three centers in five locations, and each region adopts a core node + POP point approach. In this way, once one or two computer rooms in a region fail, technology can be used to switch all traffic in the failed city to a normal computer room. Supply chain management: Do not rely on a single supplier for basic resources (including computer rooms, hardware, networks, etc.). When a supplier has a problem, you can quickly switch to other suppliers with normal services. 3. Intelligent operation and maintenance to quickly block faults There is a consensus in the industry today that the complexity of operation and maintenance is increasing rapidly, but traditional operation and maintenance is already stretched to the limit. To this end, Agora has invested huge resources and manpower to overcome the difficulties in the implementation of AI engineering, and fully applied intelligent operation and maintenance to the daily operation and maintenance of SD-RTN™, solving the pain points of traditional operation and maintenance: 7*24H uninterrupted guarantee; highly consistent and high-quality execution results; unified and efficient operation and maintenance efficiency. The AI Ops (intelligent operation and maintenance) of the sound network can identify the abnormality of the computer room and automatically operate and maintain it within 1 minute (including the overall end-to-end time of data aggregation, reporting, judgment, execution, recovery, etc.), quickly blocking the spread of the fault impact and ensuring the high availability of edge services. For example, network congestion of edge nodes is inevitable. After congestion occurs, the user's audio and video experience will be discounted (stuttering, increased latency). In this case, experienced operation and maintenance personnel will spend an average of 20 minutes from fault discovery to processing during daytime. If the fault occurs late at night or is not handled in time, it will take longer, which has a great impact on the user experience. At this time, the value of AI OPS is reflected. It can identify and handle abnormalities within 2.5 minutes, and execute 7*24 non-stop with high consistency to ensure users' high-quality RTC experience. 4. XLA, the first quality of experience standard in the RTE industry As we mentioned earlier, SLA is the criterion for service availability for many cloud vendors and the telecommunications industry, but in Agora's view, SLA regulates equipment and network access standards and focuses on service availability. However, in the RTE industry, simply meeting the "availability" standard is far from enough. Users desire clear, smooth, and uninterrupted audio and video interactions, so the quality of the real-time interactive experience must meet the "easy to use" standard. In response to this, Agora designed, defined, and launched the first experience quality standard in the real-time interactive industry in July 2020 - XLA (Experience Level Agreement), which is also the first quantifiable, verifiable, and compensable experience quality standard for the availability and experience quality of RTE services. Unlike SLA, XLA not only cares about the availability and service quality of real-time interaction, but also the user experience quality. It is also the first standard to shift the focus of quality assurance from devices to people. XLA mainly includes four experience indicators, namely 5s login success rate, 600ms video freeze rate, 200ms audio freeze rate and 400ms network delay compliance rate. The monthly compliance rate of the four indicators (1-total duration of non-compliant slices/total monthly duration) must exceed 99.5%. The 5s login success rate means that the successful login time must be less than 5 seconds to be qualified. This indicator mainly tests the availability and waiting experience of real-time interaction; the 600ms video freeze rate and 200ms audio freeze rate mainly test the smoothness experience during real-time interaction; the 400ms network delay indicator is aimed at the real-time nature of audio and video interaction, and the delay must be less than 400ms. Through XLA, customers can obtain Agora's commitment and guarantee on the quality of real-time interactive experience in multiple dimensions, such as login success rate, end-to-end latency, and audio and video freeze rate. They no longer need to worry about the quality of experience of end users, and can truly use it with confidence and satisfaction! Defining the quality standards for real-time interactive experience may seem like just a few indicators, but in fact, they carry the long-term efforts of the Agora team. The launch of the XLA quality standard was the result of repeated polishing, improvement, and verification by hundreds of technical experts on full-link data. It has gone through 10 versions of repeated iterations, adapted to 50+ network models, optimized for 200+ countries and regions, optimized for 6,000+ different types of terminal experiences, and polished 1 trillion minutes of data for the entire link. This also represents Agora's long-term cultivation and accumulation in the real-time interactive cloud industry. |
<<: We haven’t experienced 5G yet, but 6G is coming?
>>: What are the highlights of the communications industry in 2022?
Key points: After Germany proposed Industry 4.0 i...
The key to 5G is to provide diversified services ...
PTC (NASDAQ: PTC ) today announced that i...
The business model has always been a key factor t...
[[229418]] First, the Made in China 2025 stra...
The large-scale deployment and application of IPv...
VMISS is a new merchant founded in Canada. It cur...
Telecom's recent situation is a bit like &quo...
The Internet of Things (IoT) is a major business ...
[[411111]] background The hottest topic in the do...
Communication and network protocols are an essent...
EtherNetservers is a foreign hosting company esta...
HawkHost's Black Friday promotion starts at 1...
Memory security is not a new concept, but the sur...
Recently, Deepin Technology won the bid for the h...