Ele.me Cheng Yanling: Share the correct way to open the era of multi-active operation and maintenance of the whole site

Ele.me Cheng Yanling: Share the correct way to open the era of multi-active operation and maintenance of the whole site

[51CTO.com original article] On December 1-2, 2017, the WOTD Global Software Development Technology Summit hosted by 51CTO will be held in Shenzhen Zhongzhou Marriott Hotel. The theme of this summit is software development, and dozens of expert guests will bring many wonderful technical content sharing. At that time, Mr. Cheng Yanling, the head of Ele.me OPS, will share with guests the keynote speech " Crossing the Fence - Ele.me Multi-Active Operation and Maintenance Exploration " in the special session of innovative operation and maintenance exploration, and elaborate on and share Ele.me's exploration and practical experience in operation and maintenance. 51CTO sincerely invites you to come to the conference and share the joy brought by technology with us.

[[209134]]

The following is the interview transcript:

51CTO reporter: Could you please summarize the main content of this speech?

Cheng Yanling: This speech mainly shares how to cross the invisible "fence" from traditional operation and maintenance and finally realize multi-active operation and maintenance, and what changes in operation and maintenance forms are brought about in the whole process.

The speech mainly includes five aspects: 1. Business characteristics, why can Ele.me support unique multi-active; 2. Operation and maintenance planning, what operation and maintenance planning needs to be considered in the design before multi-active; 3. What complexities will it bring to the operation and maintenance system; 4. What changes will the operation system (mainly quality monitoring and efficiency) bring; 5. Automation and intelligence still have a long way to go.

51CTO reporter: Could you first introduce the main features of Ele.me's operation and maintenance work? Ele.me's business is developing very rapidly. What is the main pressure brought to the operation and maintenance work?

Cheng Yanling: There is a set of numbers that can help you quickly understand the operation and maintenance workload of Ele.me: Ele.me currently has 4 physical IDCs, 2 clouds, about 15,000 physical servers, 1,600 application appIDs, and 1,000 technical developers, supporting an average of *** orders per day. In the past year, the average daily delivery of servers was 60, the average daily release was 146 times, and the average daily rollback was 11 times. The longest network stability counter in history was 135 days.

Ele.me's O&M is actually O&M + Operation. The O&M work is similar, mainly focusing on the planning, construction, and delivery of the underlying infrastructure environment and the support of upper-level businesses. Currently, we are working towards self-service production and research. The operation ideas are very special, requiring the O&M team to be more sensitive to data such as service quality, CPU utilization, cost sharing, stability SLA, etc.

I think the biggest pressure in operation and maintenance work comes from how to keep up with the pace of business/technology development and improve the efficiency of production and research in the shortest time. For example: how to build/destroy the test environment of a service with one click, how to pull all the resources that a service depends on with one click, how to quickly prove the innocence of each dependent service after a failure, etc. We need to spend more energy thinking and improving our tool products, and not just be satisfied with the current operation and maintenance status.

51CTO reporter: It is understood that Ele.me's main site multi-active switching has been running for half a year. What is the current situation? Why do we need to do the main site multi-active switching? What are its benefits? What are the main problems it solves?

Cheng Yanling: In May this year, the first multi-active switch of the Ele.me main site was successfully completed. Then at the end of June, Ele.me launched the logistics multi-active project. On September 21, the logistics multi-active transformation was successfully completed, and Ele.me entered the multi-active era of the entire site.

Why do we need to do multi-active? Because multi-active is a revolutionary innovation in technology. In addition to solving the problem of reaching the capacity limit of a single computer room, it also undertakes more disaster recovery [bottom line] work, especially a [life extension] means for various catastrophic, short-term unrecoverable failures and external force majeure that occur in key paths, core infrastructure, and core components. In summary, supporting business expansion and disaster recovery are the two major benefits of doing multi-active. It solves the two major problems of how to quickly stop losses and restore business after the single computer room cannot be expanded and the business/technical complexity, and the effect is far better than disaster recovery.

Before the full-site multi-active drill was successful, the operation and maintenance team and the entire company had been in seclusion for a long time. As the saying goes, "before the troops move, the food and grass must go first." The basic operation and maintenance team took less than a month to complete the launch, debugging, deployment and delivery. At the same time, the DBA team and the middleware team planned the database transformation, access, and operation and maintenance solutions, and completed hundreds of support and Q&A work.

Throughout the process, the operation and maintenance team worked very hard. In many cases, the tools were outdated and there were no good reference cases for research. But even so, the business operation and maintenance team still assisted the production and research to complete the planning, deployment, and debugging of the entire multi-active test environment (simulating dual zones), and participated in discussions and implemented multiple technical transformations and deployment plans.

51CTO reporter: What other experiences are you willing to share about operation and maintenance?

Cheng Yanling: There is a hot topic in operation and maintenance work: should we do "firefighting" type of operation and maintenance, or "operation" type of operation and maintenance? The former may be the practice of most companies, and the latter is the vision of most companies. I think that to achieve "operation" type of operation and maintenance, we need to consider five aspects.

The first is standardization. Standardization is the basis of automation. Operation and maintenance work (most of it) is very trivial. Maybe I need to install a machine now, configure nginx there, and then I need to check why the log is lost, etc. In the long run, efficiency cannot be improved and the work recognition is not high. The corresponding tool products will also need to make various adaptations due to non-standard requirements, and the tools produced are not recognized (why this function is not available!!). Standardization of service core processes is the basis of automation tools.

The second is planning. Make plans in advance, such as which model you will use, unified operating system, unified deployment method, whether high availability is dual A or AB, various access specifications, usage postures, and technical solutions. Research and planning should be done in advance. The transformation of infrastructure affects the entire system. And if you transform most of it and ignore a small part, you may have a pitfall one day.

The third is efficiency. Operation and maintenance personnel need to understand the business, the other party's pain points, and try to solve a demand in a one-stop manner, while packaging some content that the business does not need to care about. From a commercial perspective, set the SLA of the service to make their own service a service that "the other party is willing to buy."

Fourth, data. Any data generated during the creation (launch) or operation of an application is very valuable. Operation and maintenance and changes to this data should be made with caution. If changes must be made, we should ask whether the process is not covered and whether the changes can be optimized. Application asset data can help us count dependencies, such as whether a connection has traffic to determine whether the business is in use, etc. Intelligent operation and maintenance relies even more on this basic data. If automation and intelligence are not done well, the data is often inaccurate.

The fifth is balance. The word balance is very vague and seems to have little to do with technology. Indeed, it is not a technical problem. For example, in business development/technology development, especially for an unprofitable business/a technology that is uncertain whether it can be promoted, how to balance the scheduling resources. You can throw the problem to the boss, but this is also a problem that the operation and maintenance team needs to think about. Therefore, technical problems are relatively easy to solve, but it is often some non-technical problems that we find it difficult to make decisions.

51CTO reporter: Recently, many voices in the industry have been talking about automated operation and maintenance, and intelligent operation and maintenance, but there is currently no unified operation and maintenance standard. How do you view the future of automated operation and maintenance and intelligent operation and maintenance? What do you think is the true connotation of intelligent operation and maintenance? What conditions are needed for its real implementation?

Cheng Yanling: Automated and intelligent operation and maintenance are bound to be the trend, but operation and maintenance face different problems at different stages. Different companies focus on different angles. Some companies may focus on cost, some companies may focus on efficiency, and some companies may focus on business. More companies focus on different issues at different stages. And there is no clear "critical point" at this stage, so it is difficult to form a unified operation and maintenance standard in the industry. But there must be a standard that suits your company's project environment, technical culture, and top-down values. It cannot be different for everyone.

I think the true connotation of intelligent operation and maintenance is data and unified operation and maintenance values. Don't be superstitious about methodology, which is just a code of conduct and a theory. The real implementation still needs to start from the perspective of solving practical problems, so as to better serve users and serve the business.

Use the Double 11 special discount code [2017WOTD1111] and go to the WOTD Global Software Development Technology Summit with me. On top of the 20% discount, you can get another 512 off! For details, click wot..com

[51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites]

<<:  The first Global Cybersecurity Industry Innovation Forum opens in Shanghai

>>:  Choosing eMTC or NB-IoT should no longer be a problem

Recommend

WiFi optimization has tricks to surf the Internet without fighting

During the Dragon Boat Festival holiday, it is ne...

China Mobile's TD-SCDMA network withdrawal begins: Fujian has taken the lead

[[259267]] Recently, the Fuzhou Radio Management ...

Internet chat, what have you learned?

I believe there is no need to elaborate on what t...

Let’s talk about how to implement RPC remote service calls?

Overview In the previous article, I introduced ho...

Can hyper-converged systems benefit from SDN architecture?

Hyperconverged systems are rapidly gaining popula...

How do analog phones achieve full-duplex communication?

Full-duplex communication refers to the ability t...

Three tips for data center network maintenance

The network is the most important component of th...

If companies don’t rise in automation, they will sink in automation.

Enterprises are constantly seeking to grow and tr...

New wireless technology extends 5G value proposition indoors

Since most 5G networks are deployed using the 3.5...