[51CTO.com original article] On May 18-19, 2018, the Global Software and Operation Technology Summit hosted by 51CTO was held in Beijing. Technical elites from global companies gathered in Beijing to discuss the forefront of software technology and explore the new boundaries of operation technology. In this conference, in addition to the star-studded main forum, the 12 sub-forums were also unique. At the "AIOps under Containers" sub-forum, Cheng Chao, head of monitoring at Alibaba Group, gave a wonderful speech on the theme of Alibaba's monitoring development path from automation to intelligence.
Cheng Chao joined Alibaba in 2008, and the first project he took over was CMDB. This generation of CMDB has been in operation for nearly ten years, and has experienced many iterations of operation and maintenance platforms. Cheng Chao initially joined Alibaba as a developer, and in recent years, he has mainly been engaged in the development and operation of the monitoring platform. Review of Alibaba Monitoring System According to Cheng Chao, Alibaba's initial monitoring system was also open source. The biggest problem with open source monitoring systems is that they cannot be scaled up. Once the scale is increased, various problems will arise. In 2009, Cheng Chao's team abandoned the open source system and built a monitoring system on their own. The first generation of self-developed monitoring system supported Alibaba's development for about five years, and some of it is still in use today. The biggest contribution of this monitoring system is that it solved the problem of quantity because it has the concept of domain. The monitoring platform used by Alibaba today is the first generation, which is also the most important generation of monitoring platform for Alibaba. It has many differences. In the past, Alibaba used Hbase for storage, but now it is moving towards HiTSDB. Different from conventional monitoring systems, Alibaba's self-developed architecture is top-down, while traditional monitoring systems are bottom-up. Cheng Chao said that Alibaba's current monitoring scale is that there are more than 90 tenants internally, including Taobao, Hema, Youku and other Alibaba businesses. The number of machines in the monitoring system is more than 4,000 virtual machines, which was the number for last year's Double Eleven. Alibaba's current monitoring system After briefly reviewing the past generations of monitoring systems, Cheng Chao talked about Alibaba's current monitoring system. He believes that several important things have been done: First, we have implemented Zero-Copy. Cheng Chao believes that the principle of designing a monitoring system is that all processing on the machine should be placed in the center, rather than directly processing on the terminal machine. When the machine monitoring system is performing monitoring tasks, it is easy to have problems. For example, CPU jitter will affect the monitoring effect. This problem has actually occurred, so Cheng Chao's monitoring system uses bandwidth to exchange for CPU without any processing or even compression. Secondly, we borrowed Akka to create our own framework. The design concept of the entire framework is relatively advanced. Of course, it has also undergone continuous debugging and improvement to meet today's needs. Cheng Chao emphasized the Agent part. He said that Alibaba has done a lot of things on Agent. When we first started working on Agent, we needed to connect to various systems because the monitoring system was built after the business system. There were no monitoring rules that everyone followed. The reality is that if I want to express a date, there are many possibilities. Today we are compatible with seven types of date formats, but there are actually more, which are less commonly used. There are also various ways to write directories . Cheng Chao emphasized that Agent needs to adapt to the business because the core value of the entire monitoring system is to ensure the stability of the business. Why does Alibaba focus on business? Cheng Chao said that the HiTSDB mentioned earlier has not been completed and is under development. Alibaba has implemented its own MQL, but the value of MQL cannot be brought into play using HBASE. However, Alibaba has a strong HBASE development and operation and maintenance team, and the use of it in recent years has been very stable without any problems. So why does Alibaba want to switch to HiTSDB? Cheng Chao explained that there are some things that HBASE cannot accomplish, such as the flexible combination of various dimensions, so Alibaba is switching to HiTSDB. HiTSDB is a database implemented by Alibaba based on the openTSDB specification. In order to adapt to the monitoring of large-scale systems, Alibaba is also working hard. HiTSDB is still in the process of continuous optimization. It is expected that the HiTSDB switch can be completed before today's Double Eleven. Through the PPT diagram above, Cheng Chao explained that this diagram covers the entire monitoring platform. The original intention of the developers was to unify the technical components under the monitoring platform. Today, there are many monitoring systems within Alibaba, and the one Cheng Chao is using is the largest in scale and the most valuable in the vertical field. In the figure, Cheng Chao marked the computing framework part that the team spent the most energy on in red. The computing framework accounts for a very large proportion of the entire structure, including disaster recovery, performance and many other aspects of the business. The establishment depends on the adoption of a large amount of manpower and material resources. He also introduced that Alibaba's monitoring system has made some achievements in computing and alarm notification. Alarms and notifications are two things that almost every monitoring system has to deal with. As the scale grows, alarms and notifications become more meaningful. At first, Alibaba only had one monitoring system. Cheng Chao's team was moving forward by trial and error. What they initially thought was worthless became more meaningful after the monitoring system was upgraded. The alarm and notification systems are also independent and crucial to the monitoring system. The biggest difference between monitoring system A and monitoring system B is the field they target. I believe that when the alarm function is truly popularized in the entire computing field, they will be similar. Cheng Chao used WeChat, SMS, email, and DingTalk during training. The advantage of this is that you can do a lot of things at the notification layer, such as alarms, storm problems, and problems that are difficult to solve in the monitoring system. When we try to cut this layer out, we have the opportunity to create some value outside the monitoring system. Cheng Chao believes that in today's monitoring system field, people still care too little about the business, because many people who work on monitoring systems used to be engaged in operation and maintenance business, or development business. They have a natural characteristic, that is, they think that monitoring systems are only for solving operation and maintenance problems. Such cognition and pattern are not enough. We saw that in the second-generation architecture (Alibaba), the monitoring system made at that time was only for solving operation and maintenance problems. However, last year, Alibaba disbanded the entire operation and maintenance team. If this radical change is not made, the so-called DevOps is just talk. After the operation and maintenance team is disbanded, many platform-level, tool-level, automated, and intelligent ones will gradually follow up. Without the nanny-like service of operation and maintenance, the tool team and the development team must evolve a set of user models. We hope to make this model into a subdivided, all-round, full-link, and vertical model. The vertical model refers to our network quality, applications, line indicators, APM, network, DIC, and then data. We hope to use this model to connect and combine them. This is the future construction direction of our monitoring system. The speeches of the speakers at this WOT Summit are compiled and edited by 51CTO. If you want to know more, please log in to WWW.51CTO.COM to view them. [51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites] |
<<: Huawei Cloud builds EI intelligent body to promote the evolution of "inclusive AI"
>>: WOT Xu Dongchen: JVM-Sandbox Non-intrusive runtime AOP solution based on JVM
According to the 21st Century Business Herald, &q...
IO multiplexing technology is an important knowle...
[51CTO.com original article] In the era of digita...
Edge computing has become one of the hottest tech...
[[374332]] At the 2021 National Industrial and In...
According to foreign media, Ericsson recently rel...
ZJI is a well-known hosting company in the WordPr...
An organization once worked with MIT to interview...
Aoyozhuji, a long-established foreign VPS service...
Since the three major operators issued 5G commerc...
MicrosoftInternetExplorer402DocumentNotSpecified7....
【51CTO.com Quick Translation】In 2014, two years a...
[[416676]] In RF circuits, RF devices with variou...
It's the start of another school year. China ...