New challenges for operation and maintenance in 2018: Three experts tell you how to achieve intelligent operation and maintenance

New challenges for operation and maintenance in 2018: Three experts tell you how to achieve intelligent operation and maintenance

[51CTO.com original article] With the trend of cloud computing spreading throughout the industry, and with the strong promotion of advanced operation and maintenance concepts such as DevOps and SRE, operation and maintenance innovation has become a key role in driving the transformation of R&D and operation and maintenance processes and concepts of major companies, such as the implementation of continuous integration and release, scenario-based operation and maintenance automation, and intelligent monitoring. At the same time, the role positioning of operation and maintenance work is also quietly changing, from the original passive response at the end to the role of technical products, technical operations and platform builders.

On June 30, although the afternoon sun was scorching, the attendance rate of the 21st Tech Neo "New Challenges in Operation and Maintenance" technical salon hosted by 51CTO was still full. Three front-line operation and maintenance experts, together with more than 100 IT professionals present, shared their practices and thoughts on container-based continuous integration and release, intelligent monitoring and fault self-healing, and cost and performance optimization. The on-site discussion session was very popular, and the experts' speeches were full of practical information, directly hitting the pain points of operation and maintenance, and the audience burst into laughter from time to time.

[[234662]]

Yu Bingzhe from Sina Weibo: Practice in operation and maintenance based on real-time log collection system

[[234663]]

Yu Bingzhe is the head of the log system of Sina Weibo department. He has 5 years of practical experience in the field of log processing and is responsible for the maintenance of the log system of the Mobile Service Assurance Department of Sina Weibo Mobile Weibo Product Department.

In his speech, Yu Bingzhe first introduced the Sina Mobile Weibo MAPI log architecture to everyone, and then he showed everyone from a practical perspective how to use this architecture to monitor links, perform performance analysis on the server from the client's perspective, calculate multiple dimensions of client videos, as well as ES real-time API services, cost accounting and other operations.

Of course, Yu Bingzhe also shared the problems encountered by the technical team at work and gave solutions one by one, such as log loss, ES cluster monitoring, ES server quality imbalance, and Rsyslog to Kafka queue architecture migration, Kafka monitoring and management, etc. Taking the ES server quality imbalance problem as an example, the technical team first pre-migrated the shards on the machine according to the regression load of different machines, and then pre-distributed them according to different businesses to ensure that independent resource services have their own resource pools and shared users use public resources.

Qunar Lv Xiaoxu: The evolution of Qunar.com's operation and maintenance platform from 0 to 1

[[234664]]

Lv Xiaoxu is the head of real-time systems at Qunar.com and the director of operations and development at Qunar. He is mainly responsible for the construction and maintenance of Qunar's data flow infrastructure. He has worked for Yahoo China and Taobao.com, and his main work was Etao.com data crawling and web page analysis.

Lv Xiaoxu introduced Prism, the real-time data platform of Qunar.com. Prism is a real-time data platform that takes data visualization as its starting point and aims to reduce the cost of acquiring data and data analysis software. Through this platform, people can perform real-time log monitoring (ELK), data bus (Kafka), real-time data analysis (Spark Streaming/Storm/Flink), data storage (Elasticsearch as a Service), and OLAP/experimental platform (Zeppelin+Spark/Flink).

So what stages of evolution has the Prism operation and maintenance platform gone through? He said that when technologies such as Docker, MARATHOM, and MESOS appeared, they were as excited as discovering a new continent. Using these technologies, they can quickly increase or decrease the capacity of the system, and can also quickly support new tools, improve hardware resource utilization, and reduce the cost of using data software. Lv Xiaoxu introduced in great detail how these technologies help the platform evolve and what problems were encountered in the process.

At the end of his speech, Lv Xiaoxu concluded that what he and his technical team have done is to solve the threshold for deploying data software and the threshold for deploying the Mesos environment. At present, there are still problems such as unbalanced load and slow location of data anomalies. He plans to solve these two problems first, then connect new software and build a GPU computing platform.

Sun Jie from Sinopec Ruifei: Exploration and practice of intelligent operation and maintenance in large enterprises

[[234665]]

Sun Jie is an IT veteran with more than a decade of experience in the industry. He focuses on systems, databases, cloud computing, and intelligent operation and maintenance management. He participates in the implementation of data center construction, private cloud architecture planning and operation and maintenance management, big data mining and other related work. He is a practitioner and evangelist in the IT industry.

At the beginning, Sun Jie pointed out that traditional operation and maintenance software is gradually not adapted to the needs of operation and maintenance, such as data dispersion, repeated collection, and waste of resources. He believes that operation and maintenance should be continuously upgraded from traditional "equipment-centric maintenance" to "data-centric operation". "Although most companies' operation and maintenance are mainly manual, supplemented by development tools and a small amount of automated operation and maintenance, I believe that intelligent operation and maintenance will be the mainstream development trend in the future."

In his speech, Sun Jie described his ideal intelligent operation and maintenance state. Whether on or off the cloud, ensuring the stable operation of the business system is the most important task. He listed three key points: First, by deploying an intelligent operation and maintenance system, the operation and maintenance efficiency can be significantly improved, and the ability and value of the operation and maintenance team can be greatly enhanced; second, by deploying an intelligent operation and maintenance system, the operation and maintenance transparency can be significantly increased, so that management and operation and maintenance personnel can have more initiative and control; third, by deploying an intelligent operation and maintenance system, the failure frequency can be significantly reduced, making operation and maintenance more worry-free.

Afterwards, Sun Jie shared his experience in operation and maintenance scenarios such as panoramic business service management, log collection monitoring and alarm, and knowledge base fault autonomy from a practical perspective. Since he talked about all the problems encountered in actual work, he resonated with many audience members. After the speech, many audience members rushed to ask questions, and the atmosphere of on-site communication was extremely lively.

51CTO began to hold a technical salon with the theme of Tech Neo in 2016, aiming to provide IT technicians with a high-quality offline platform for learning and communication. It is currently limited to the Beijing area, with a frequency of once a month. Each issue discusses a topic covering multiple technical fields such as artificial intelligence, big data, cloud computing, blockchain, and the Internet of Things.

[51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites]

<<:  WOT Xu Dongchen: JVM-Sandbox Non-intrusive runtime AOP solution based on JVM

>>:  The hidden threat of smart home privacy leakage comes from the router

Recommend

In the 5G era, how to innovate network construction models?

The full opening of the 5G commercial era and the...

Don’t just focus on SD-WAN, pay attention to IPv6

The Internet of Things (IoT) is fundamentally cha...

For the first time, such a clear and unconventional explanation of K8S network

[51CTO.com original article] K8S network design a...

Should you upgrade your 5G package? Read this article before deciding

Recently, I often receive such calls on my two mo...