Baidu Takeaway Zhang Jian: Using software engineering thinking to solve operation and maintenance problems

Baidu Takeaway Zhang Jian: Using software engineering thinking to solve operation and maintenance problems

[51CTO.com original article] On April 14, 2017, WOTA2017 Global Architecture and Operation Technology Summit will be held in Beijing. Zhang Jian, senior operation and maintenance engineer and technical leader of the operation and maintenance department of Baidu Takeaway R&D Center, will give a wonderful speech as a guest speaker of the "DevOps and Continuous Delivery Special". In an interview before the meeting, Zhang Jian told reporters: "In the past year, six people in the Baidu Takeaway Operation and Maintenance R&D Team (SRE) have completed 11 business platforms, 6 general services, 11+ golang lib libraries, and met the connection of 30+ business services. We started from scratch and moved forward in constant challenges."

【Lecturer Profile】

[[188217]]

Zhang Jian, Operation and Maintenance Technology Leader, Operation and Maintenance Department, Baidu Takeaway R&D Center

Zhang Jian joined Baidu in 2013 and has been responsible for the development and maintenance of Tieba, ksarch, and private PAAS platforms. In 2016, he joined Baidu Waimai and was mainly responsible for the Waimai operation and maintenance R&D team (now renamed SRE), covering a wide range of content, including the construction of Network, SYS and R&D teams, Network SYS BP&IT OP and other R&D work, and was responsible for the two major operation and maintenance platforms Pacific (one-stop operation and maintenance platform) and Atlantic (data management platform).

From manual operation and maintenance to intelligent operation and maintenance

From the perspective of the development of operation and maintenance technology, we can see that in the early days, ssh+exp replaced the manual login server maintenance mode, and a large amount of operation and maintenance work had to be implemented through batch scripts. At this stage, the problem often encountered was that complex logic was difficult to implement, and it was not enough to rely solely on scripting to complete operation and maintenance operations. Especially when the business scale expanded and became more complicated, operation and maintenance work became increasingly difficult to cope with, requiring a lot of manpower and prone to errors, so the concept of operation and maintenance tools emerged. Entering the tool era, everyone took chef/puppet configuration tools as representatives to turn the ability of operation and maintenance into tool capabilities one by one. The basic idea at this time is to implement all complex operations of operation and maintenance through software configuration systems. Behind the complex operation may still be script operations on a single machine, because it is the easiest to maintain for operation and maintenance personnel. If it is all programmed, it is very difficult. With the increasing requirements for IT agility, it is necessary to platformize the capabilities of operation and maintenance tools and further solidify the common scenarios of operation and maintenance. The requirements for refinement of operation and maintenance are getting higher and higher, requiring more and more comprehensive operation and maintenance capabilities, more comprehensive automation capabilities and data analysis capabilities. As a result, operation and maintenance has entered the era of intelligence.

Therefore, DevOPS, a tool-based solution or an idea, emerged. In recent years, the concept of DevOPS has attracted widespread attention at home and abroad. It can achieve rapid application deployment, thereby shortening product time to market, reducing the failure rate of new versions, and shortening the repair time and mean recovery time of crash events. The goal of DevOPS is to maximize the predictability, efficiency, security and maintainability of the operation and maintenance process through automated methods.

Zhang Jian said that DevOps is, in layman's terms, responsible for the intersection of OP (operation and maintenance), QA (testing), and RD (research and development), such as continuous delivery, online deployment, etc. In addition, many people also equate it with operation and maintenance research and development, and tend to solve OP's automation and platform problems. In fact, the real DevOps is just a term, which is about how to solve the needs in the process of supporting business development through software engineering. It's just that the business subject of development services is not the vast number of netizens, but the roles within the company such as OP, QA, and RD. The goal of development is to improve the efficiency of each link, release manpower from tedious manual operations, and connect the data and operations of each link in each department. If you want to be a long-term enterprise, unified and stable basic services are indispensable to improve business development productivity.

See how Baidu Waimai builds an operation and maintenance platform through software engineering ideas

How Baidu Waimai completed 11 business platforms, 6 general services, 11+ golang lib libraries in less than a year, met the connection of 30+ business services, and achieved a good basic support for the company's spinoff and rapid business iteration. Zhang Jian said frankly: "All this stems from our adoption of the SRE methodology. SRE is the abbreviation of Site Reliability Engineer, which was proposed by Google and is its new exploration of the operation and maintenance model. SRE is to use the methods and means of software engineers to recruit software engineers to solve the problems of operation and maintenance. It is also the real practice of DevOps thinking in operation and maintenance."

He pointed out that DevOps focuses on automation of operation and maintenance processes, while SRE focuses more on reliability and systematic thinking. Compared with DevOPS, SRE is broader and deeper, because reliability can be done from top to bottom, and there is no end. In addition, SRE's set of methodologies is more complete. For many people, DevOps and SRE do the same thing, it's just a term, but their focus is different, and the methodological foundation is completely different. One of the highlights of SRE in operation and maintenance practice is to build a platform-based service system that can balance the risks of service unavailability and rapid product innovation and improve operation and maintenance efficiency. Therefore, Baidu Takeout established a dedicated SRE team in 2016 to provide comprehensive technical support and platform development for Baidu Takeout Spinoff. The SRE team has created two major technical platforms: Pacific, a one-stop service platform for operation and maintenance, and Atlantic, a data display and alarm platform.

Pacific platform navigation interface

Pacific platform network monitoring interface

During the construction of the platform, considering the need to manage various resources, complex dependent services, decentralized deployment and execution, and multiple process control links, the platform components are divided into business logic and general services. A unified technology stack is established, and the GO language with high development efficiency, execution efficiency and maintainability is adopted, so that module functions can be reused, codes can be reused, and programming is universal.

Platform construction phase

During the spinoff period, the team and platform were built from scratch. In the face of large development demands and insufficient staff, some development ideas were put into practice, such as balancing business needs and development needs to achieve a win-win situation.

On the one hand, in order to quickly realize and enhance the scalability of the platform, a microservice construction method is adopted. Through one-way access between services and no callback mode, the complexity of microservice calls is solved; golang+ binary embedded static resources are adopted to solve the complexity of microservice deployment. On the other hand, in order to increase the growth of the team and technology stack, many open source ideas and modules are introduced. For example: flexible use of K8s APIServer working mode to realize the configuration center, etc., deep customization of Open-Falcon+Grafana to meet the needs of data alarm and display.

In an interview with ***, Zhang Jian said that Baidu Waimai has accumulated rich experience in the operation and maintenance process, which he will bring to the WOTA2017 Global Architecture and Operation and Maintenance Technology Summit: "I will share how Baidu Waimai completed a large amount of platform development in less than one year to meet the needs of business migration and iteration. The main explanation is how to extract universality, which core points to grasp, and how to ensure scalability. For example: how to find the most core among many needs, how to prioritize solving universal problems, and provide universal solutions and services to avoid duplication of work."

World Of Tech focuses on the field of Internet IT technology

Three chapters, 15 technical sessions,
More than 50 top domestic and international Internet elites will come to create an intensive and practical camp that takes into account technical vision, technical practice, and technical foresight!

[51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites]

<<:  Huawei and its ecosystem partners share the responsibility for enterprise digital transformation and cloud migration

>>:  LinkSure Network attended the International World Wide Web Conference (WWW2017) and published a paper

Blog    

Recommend

What are the deployments and arrangements for 5G in 2022? MIIT responds

On January 20, the State Council Information Offi...

30 pictures to explain HTTP, if you don't believe it, you still don't know it

During the interview process, HTTP is often asked...

TCP three-way handshake and four-way wave

TCP (Transmission Control Protocol) is a connecti...

How can 5G fixed wireless access replace fiber optic access to the last mile?

[[180048]] Verizon, a US operator, announced that...

Shengye: Equipping "engineering projects" with a digital brain

The construction industry is an important pillar ...

HarmonyOS Sample Network Management

[[425392]] For more information, please visit: Ho...

5G latency: Why faster networks matter

When you look at your mobile network or home broa...