Combining AI and big data, the intelligent operation and maintenance platform helps Liulishuo improve its core competitiveness

Combining AI and big data, the intelligent operation and maintenance platform helps Liulishuo improve its core competitiveness

High-quality content and customized services enhance the core competitiveness of enterprises

Affected by the epidemic in 2020, under the slogan of "suspending classes but not learning", the scale of the online education market has increased rapidly, reaching 485.8 billion yuan. After the rapid development of the online education industry in the past few years, the market has become relatively mature, and users have also put forward different demands for different types of online education institutions. Therefore, traffic alone can no longer win loyal users. But for the education industry, the core competitiveness is still high-quality content and services. Only with high-quality course content, personalized plans based on customer learning habits and foundations, high-quality product experience and stability, combined with higher business operation efficiency, can enterprises win long-term development. Looking at the entire online education industry, in the constant adjustment, the companies that finally survive must also return to the essence of education and win long-term development with high-quality products, content and services.

Combined with artificial intelligence, the characteristic teaching is unique

After further adjustments in the industry, companies in the online education sector will gradually shift their focus from incremental growth to content construction. However, in the overall environment, the syllabus is the same and the teaching methods are very different. Although the courses are different, they are still not amazing, and most companies cannot rely on content to stand out.
But Liulishuo is different. In this era of artificial intelligence, Liulishuo relies on its unique intelligent teaching courses and innovative technologies such as artificial intelligence (AI) to provide users with personalized teaching courses to help more users improve their English. As of March 31, 2021, Liulishuo has accumulated more than 200 million registered users, and its giant "Chinese English Voice Database" can be evaluated according to the actual situation of each student. During the process of learning pronunciation in Liulishuo, students can use the intelligent lip shape recognition and correction system to dynamically capture the key points of the students' mouths, so as to compare them with advanced technology and find the problems with the students' pronunciation. In this way, targeted guidance can be put forward to solve problems in oral expression, fundamentally helping students improve their oral level.

Product experience is the key, but improving system stability is a challenge

With the rapid development of Liulishuo's business, the number of users has increased significantly, from the initial several million users to over 200 million. The changes in data traffic during peak and low periods, business complexity and analysis difficulty have brought huge challenges to operation and maintenance. In the overall Internet environment, experience is one of the most critical competitive advantages. According to statistics, every 1 second of delay will lead to an average of 7% user loss.

As a company without a separate operation and maintenance department, the operation and maintenance system of the Liulishuo basic platform is mainly developed by the cloud-infra team. The team's core demands are not only SLA, performance monitoring, alarms and providing relevant data for problem location, but also include cloud-infra's technical value operations, such as utilization, cost savings, business relationship networks, etc.

Under these core demands, the requirements for the intelligent operation and maintenance platform are:
1. Collect and monitor various heterogeneous data sources, including machine indicators and utilization rates on K8s and ECS, Istio-related call logs, self-built middleware-related indicators, indicators provided by cloud services, business trace data, etc. In addition, real-time collection of various cost data is also required.
2. Dynamic discovery and collection of various resources, including department-related data such as organizational relationships, also need to be updated in real time so that the most accurate relevant indicators and attribution relationships can be fed back in real time.
3. Large-scale data storage and analysis. Due to the large scale of Liulishuo's business, the various cloud resources used and the amount of data generated by the business are very huge, tens of TB per day. The solution needs to meet the real-time analysis and display capabilities at this scale.
4. The monitoring platform is responsible for stability issues, and its own stability also needs to be done well. Therefore, it is necessary to eliminate single point problems in each part and have the ability to recover quickly from abnormalities.

One-stop intelligent operation and maintenance solution, connecting the entire chain from data collection to computing

The intelligent operation and maintenance platform built by Liulishuo needs to process not only time-series data, but also the core business availability data needs to be calculated and analyzed through various logs. Therefore, two data solutions, Logs and Metrics, need to be selected overall. There are different community solutions or commercial solutions for these two types of data, such as ES, Loki, SLS, Prometheus, OpenTSDB, InfluxDB, etc. In the end, Alibaba Cloud SLS was selected as the log solution, and Prometheus+SLS was selected as the time series solution. The main reasons are as follows:
1. SLS has the ability to uniformly store and analyze various types of data, and can be associated with metrics and logs data on SLS, which is not available on other platforms
2. The SLS platform can adapt to very large data scales, has much better performance than ES, and is also a maintenance-free service, eliminating the need to maintain high reliability of ES by yourself.
3. The time series solution is mainly based on Prometheus. The Prometheus ecosystem is very complete, and PromQL is also simple to use. The SLS time series library can be used as a remote high-reliability storage for Prometheus, which can solve the reliability problem of Prometheus.
4. The SLS solution has a data processing function that can perform join analysis and processing with external data sources, and can better handle various complex logs and add catalog-related information to the logs.

At the same time, in order to achieve automation to the greatest extent possible, Alibaba Cloud Log Service (SLS) has developed a mechanism for dynamic discovery of IaaS and PaaS resources suitable for cloud scenarios. It can add newly purchased and created resources to monitoring and collection in real time, avoiding most manual operations.

In each data scenario, Alibaba Cloud Log Service SLS is also specially customized to meet the needs of Liulishuo:

1. Log

Logs of different businesses are directly collected into different log repositories through Logtail of SLS. Not all logs need to be stored and indexed for a long time, so we classify the logs. Those that require auditing will be delivered to OSS for long-term storage. Logs for business troubleshooting are only kept for 2 weeks, and full-text indexing is enabled. AccessLog only enables indexing of some fields, which can save a lot of indexing costs.
For NGINX access logs that need to calculate SLA and PXX indicators, data processing will be used to map the URLs in the NGINX access logs to corresponding departments, applications, methods, etc. in combination with some mapping rules, departments, applications, and other catalog information stored in RDS.

2. Data monitoring

Prometheus was chosen as the monitoring solution. For the scenario of Liulishuo, we developed some Exporters to obtain metrics from various cloud products and self-built components.
At the same time, in order to better use Prometheus and integrate it with the internal CICD system, we added a Sidecar to Prometheus to monitor changes in the Git repository and dynamically reload the Prometheus configuration based on the changes.
In order to improve the query speed, various Recording Rules are configured on Prometheus, which are all managed by Git.
AlertManager's alarms are directly connected to the internal alarm center, which can perform advanced functions such as typesetting and upgrading. In order to solve the problem of Prometheus single point and the problem of subsequent correlation analysis with Catalog, we use the SLS time series library to directly let Prometheus Remote Write to the SLS time series library

3. Indicator calculation

The calculation of core indicators is partly derived from NGINX's AccessLog. From the entrance, you can get the QPS, error rate, and latency (average, PXX, etc.) of each business, which is not intrusive to the business. Indicators such as resource utilization, middleware, and infrastructure are derived from the time series library written by Prometheus. Based on the Catalog, the relevant indicators of each department and business can be aggregated and calculated. After the calculation, the indicator information is completed. Since the amount of data is very small, it can be easily stored in MySQL and ES, and a copy can be sent to OSS for backup.

Build a unified intelligent operation and maintenance platform, transforming it from a cost center to an innovative productivity tool

Currently, this intelligent operation and maintenance platform system carries almost all the core operations and maintenance of the company. It has been running stably since its launch, and can easily cope with sudden increases in data volume during various activities. The overall business value is mainly reflected in:

Monitoring: The first value of monitoring is to do all kinds of monitoring and alarming, especially SLA-related. Since the data has been associated with specific departments and business applications, it is easy to get the SLA of each department and application, and to promote and improve the company-wide unified problem troubleshooting and fault isolation: Based on Istio's access logs, combined with Catalog information, the call relationship of each application can be calculated, so the business relationship grid can be generated in real time, and the quality of each relationship (edge) can be known. After understanding the business relationship, when a problem occurs, the root cause and fault isolation can be quickly located.
FinOps: In the Cloud Infra department, the most challenged issue is the cost issue. Therefore, cost optimization is also one of our core tasks. The main approach is to calculate the resource utilization of each department and team, including the average utilization and the utilization of various PXXs (as shown in the table below), so as to determine the resource usage of each department and promote cost optimization in each department.

Write to the end

In the cloud-native era, digitalization is driving business innovation in all industries. Only by improving user experience, accelerating innovation, updating infrastructure and architecture, and making good use of diverse data can we stand out in the overall environment. The intelligent operation and maintenance platform launched by Alibaba Cloud is not only to help engineers reduce their workload, but also to free operation and maintenance engineers from various mechanized work. We will take care of all the "dirty and tiring work", greatly reduce the time of failures, and allow operation and maintenance personnel to focus more creativity on digital innovation and enterprise business innovation, providing enterprises with better competitiveness.

High-quality content and customized services enhance the core competitiveness of enterprises
Affected by the epidemic in 2020, under the slogan of "suspending classes but not learning", the scale of the online education market has increased rapidly, reaching 485.8 billion yuan. After the rapid development of the online education industry in the past few years, the market has become relatively mature, and users have also put forward different demands for different types of online education institutions. Therefore, traffic alone can no longer win loyal users. But for the education industry, the core competitiveness is still high-quality content and services. Only with high-quality course content, personalized plans based on customer learning habits and foundations, high-quality product experience and stability, combined with higher business operation efficiency, can enterprises win long-term development. Looking at the entire online education industry, in the constant adjustment, the companies that finally survive must also return to the essence of education and win long-term development with high-quality products, content and services.

Combined with artificial intelligence, the characteristic teaching is unique

After further adjustments in the industry, companies in the online education sector will gradually shift their focus from incremental growth to content construction. However, in the overall environment, the syllabus is the same and the teaching methods are very different. Although the courses are different, they are still not amazing, and most companies cannot rely on content to stand out.
But Liulishuo is different. In this era of artificial intelligence, Liulishuo relies on its unique intelligent teaching courses and innovative technologies such as artificial intelligence (AI) to provide users with personalized teaching courses to help more users improve their English. As of March 31, 2021, Liulishuo has accumulated more than 200 million registered users, and its giant "Chinese English Voice Database" can be evaluated according to the actual situation of each student. During the process of learning pronunciation in Liulishuo, students can use the intelligent lip shape recognition and correction system to dynamically capture the key points of the students' mouths, so as to compare them with advanced technology and find the problems with the students' pronunciation. In this way, targeted guidance can be put forward to solve problems in oral expression, fundamentally helping students improve their oral level.

Product experience is the key, but improving system stability is a challenge

With the rapid development of Liulishuo's business, the number of users has increased significantly, from the initial several million users to over 200 million. The changes in data traffic during peak and low periods, business complexity and analysis difficulty have brought huge challenges to operation and maintenance. In the overall Internet environment, experience is one of the most critical competitive advantages. According to statistics, every 1 second of delay will lead to an average of 7% user loss.
As a company without a separate operation and maintenance department, the operation and maintenance system of the Liulishuo basic platform is mainly developed by the cloud-infra team. The team's core demands are not only SLA, performance monitoring, alarms and providing relevant data for problem location, but also include cloud-infra's technical value operations, such as utilization, cost savings, business relationship networks, etc.
Under these core demands, the requirements for the intelligent operation and maintenance platform are:
1. Collect and monitor various heterogeneous data sources, including machine indicators and utilization rates on K8s and ECS, Istio-related call logs, self-built middleware-related indicators, indicators provided by cloud services, business trace data, etc. In addition, real-time collection of various cost data is also required.
2. Dynamic discovery and collection of various resources, including department-related data such as organizational relationships, also need to be updated in real time so that the most accurate relevant indicators and attribution relationships can be fed back in real time.
3. Large-scale data storage and analysis. Due to the large scale of Liulishuo's business, the various cloud resources used and the amount of data generated by the business are very huge, tens of TB per day. The solution needs to meet the real-time analysis and display capabilities at this scale.
4. The monitoring platform is responsible for stability issues, and its own stability also needs to be done well. Therefore, it is necessary to eliminate single point problems in each part and have the ability to recover quickly from abnormalities.

One-stop intelligent operation and maintenance solution, connecting the entire chain from data collection to computing

The intelligent operation and maintenance platform built by Liulishuo needs to process not only time-series data, but also the core business availability data needs to be calculated and analyzed through various logs. Therefore, two data solutions, Logs and Metrics, need to be selected overall. There are different community solutions or commercial solutions for these two types of data, such as ES, Loki, SLS, Prometheus, OpenTSDB, InfluxDB, etc. In the end, Alibaba Cloud SLS was selected as the log solution, and Prometheus+SLS was selected as the time series solution. The main reasons are as follows:
1. SLS has the ability to uniformly store and analyze various types of data, and can be associated with metrics and logs data on SLS, which is not available on other platforms
2. The SLS platform can adapt to very large data scales, has much better performance than ES, and is also a maintenance-free service, eliminating the need to maintain high reliability of ES by yourself.
3. The time series solution is mainly based on Prometheus. The Prometheus ecosystem is very complete, and PromQL is also simple to use. The SLS time series library can be used as a remote high-reliability storage for Prometheus, which can solve the reliability problem of Prometheus.
4. The SLS solution has a data processing function that can perform join analysis and processing with external data sources, and can better handle various complex logs and add catalog-related information to the logs.

At the same time, in order to achieve automation to the greatest extent possible, Alibaba Cloud Log Service (SLS) has developed a mechanism for dynamic discovery of IaaS and PaaS resources suitable for cloud scenarios. It can add newly purchased and created resources to monitoring and collection in real time, avoiding most manual operations.

In each data scenario, Alibaba Cloud Log Service SLS is also specially customized to meet the needs of Liulishuo:

1. Log

Logs of different businesses are directly collected into different log repositories through Logtail of SLS. Not all logs need to be stored and indexed for a long time, so we classify the logs. Those that require auditing will be delivered to OSS for long-term storage. Logs for business troubleshooting are only kept for 2 weeks, and full-text indexing is enabled. AccessLog only enables indexing of some fields, which can save a lot of indexing costs.
For NGINX access logs that need to calculate SLA and PXX indicators, data processing will be used to map the URLs in the NGINX access logs to corresponding departments, applications, methods, etc. in combination with some mapping rules, departments, applications, and other catalog information stored in RDS.

2. Data monitoring

Prometheus was chosen as the monitoring solution. For the scenario of Liulishuo, we developed some Exporters to obtain metrics from various cloud products and self-built components.
At the same time, in order to better use Prometheus and integrate it with the internal CICD system, we added a Sidecar to Prometheus to monitor changes in the Git repository and dynamically reload the Prometheus configuration based on the changes.
In order to improve the query speed, various Recording Rules are configured on Prometheus, which are all managed by Git.
AlertManager's alarms are directly connected to the internal alarm center, which can perform advanced functions such as typesetting and upgrading. In order to solve the problem of Prometheus single point and the problem of subsequent correlation analysis with Catalog, we use the SLS time series library to directly let Prometheus Remote Write to the SLS time series library

3. Indicator calculation

The calculation of core indicators is partly derived from NGINX's AccessLog. From the entrance, you can get the QPS, error rate, and latency (average, PXX, etc.) of each business, which is not intrusive to the business. Indicators such as resource utilization, middleware, and infrastructure are derived from the time series library written by Prometheus. Based on the Catalog, the relevant indicators of each department and business can be aggregated and calculated. After the calculation, the indicator information is completed. Since the amount of data is very small, it can be easily stored in MySQL and ES, and a copy can be sent to OSS for backup.

Build a unified intelligent operation and maintenance platform, transforming it from a cost center to an innovative productivity tool

Currently, this intelligent operation and maintenance platform system carries almost all the core operations and maintenance of the company. It has been running stably since its launch, and can easily cope with sudden increases in data volume during various activities. The overall business value is mainly reflected in:

Monitoring: The first value of monitoring is to do all kinds of monitoring and alarming, especially SLA-related. Since the data has been associated with specific departments and business applications, it is easy to get the SLA of each department and application, and to promote and improve the company-wide unified problem troubleshooting and fault isolation: Based on Istio's access logs, combined with Catalog information, the call relationship of each application can be calculated, so the business relationship grid can be generated in real time, and the quality of each relationship (edge) can be known. After understanding the business relationship, when a problem occurs, the root cause and fault isolation can be quickly located.
FinOps: In the Cloud Infra department, the most challenged issue is the cost issue. Therefore, cost optimization is also one of our core tasks. The main approach is to calculate the resource utilization of each department and team, including the average utilization and the utilization of various PXXs (as shown in the table below), so as to determine the resource usage of each department and promote cost optimization in each department.

Write to the end

In the cloud-native era, digitalization is driving business innovation in all industries. Only by improving user experience, accelerating innovation, updating infrastructure and architecture, and making good use of diverse data can we stand out in the overall environment. The intelligent operation and maintenance platform launched by Alibaba Cloud is not only to help engineers reduce their workload, but also to free operation and maintenance engineers from various mechanized work. We will take care of all the "dirty and tiring work", greatly reduce the time of failures, and allow operation and maintenance personnel to focus more creativity on digital innovation and enterprise business innovation, providing enterprises with better competitiveness.

<<:  Which industry will be the hot spot for artificial intelligence in the 5G era?

>>:  The scale of data is growing explosively. Practical sharing of data-based operations of cloud-native data warehouses

Recommend

Understanding the 5G industry chain in one article

Hello everyone, I am Xiaozaojun. Today I would li...

Stop shouting slogans, how to implement IPv6? Operators give details

IPv6, which is "not fast enough to keep up w...

Cisco launches AI-based predictive services

[51CTO.com original article] Recently, Cisco anno...

Ten tips to increase page browsing time

[51CTO.com Quick Translation] Increasing the brow...

In 2024, the core network will usher in new opportunities!

In today’s article, let’s talk about the core net...

Four Key Words for Successful Large-Scale Deployment of IPv6

【51CTO.com original article】In 2019, IPv6 transfo...

HTTPS 7-way handshake and 9 times delay

HTTP (Hypertext Transfer Protocol) has become the...

Understand the IP location function of the entire network in one article

Recently, WeChat, Douyin, Weibo, public accounts ...

The network protocols behind server push, online gaming, and email

We have talked a lot about network protocols befo...