How Amazon can achieve continuous delivery

How Amazon can achieve continuous delivery

I went to a large domestic e-commerce company for a discussion some time ago, and I also reviewed Amazon's experience for comparison. The following information should be publicly available.

Amazon allows a single developer to design, develop, test, release, and monitor functions. On average, a release is made every ten seconds, and thousands of releases are made every day, ensuring fast and high-quality continuous delivery.

In terms of engineer management, DevOps was mainly implemented. Everyone had smaller and clearer tasks. You build it, you run it. In terms of tools, this was mainly due to a Build tools group. They made the Platform and Internal tools the best in the industry in terms of functionality and ease of use. This allowed Amazon retail's huge and complex website to be used smoothly at all times. The main tools made by this group are divided into five categories:

  1. Brazil Build distributes and builds packages. Each build involves at least hundreds of packages. The build can be completed in a few minutes or even tens of seconds without errors.
  2. Apollo Deployment manages the environment. For example, which package groups are needed for a service to go online, what are the dependencies, which parameters need to be set, how many machines are needed, etc. Each time, the smallest service unit will involve deployment on dozens of hosts.
  3. Code base, all codes are stored in the central code library, and all related codes can be searched by reference, method, keyword, etc.
  4. Monitoring System: monitors services, issues alarms, performs fault analysis, etc.
  5. Pipeline connects build, test, and deploy to monitor the entire process. Most operations such as rebuild, code rollback, and stop deployment can be completed with one click.

Compared with Microsoft, all Amazon tools are used uniformly by the whole company, updated in a timely and unified manner, and there is a very large group responsible for development and maintenance. However, due to the organizational structure of Microsoft, the codes between different groups are not visible to each other, and each group works independently to develop these tools. The scattered energy and the opacity of code/api have led to very poor online infra. As a result, if Microsoft wants to roll back once, it has to call PM, QA, Dev and others to make a big move. However, any Dev at Amazon can roll back with just one click through Apollo. This also makes it almost impossible for Microsoft to do daily deployment, let alone hourly deployment.

Facebook has a history of more than ten years, but the Ops experience is still relatively insufficient. Sometimes I see some tools used by friends at Facebook when they are working, and the overall feeling is that there is a lack of unified planning. There are deployment tools and monitoring, but they are not perfect enough. Fortunately, the engineers are strong enough, and we can rely on the personal qualities of engineers to solve some problems. I will add a few words about this later.

With Google's talent and technical strength, it is natural that there is no shortage of internal tools. The only difference is that they are a little less easy to use than Amazon's. Of course, for Google engineers, this difference does not cause much impact.

I just mentioned that Facebook and its tools are not perfect yet, and often it depends on the quality of engineers. The usability of Google's tools can also be improved. So why does Amazon make internal tools so powerful and productized? According to the thinking of most companies, internal tools are for internal use anyway, so it doesn't matter if they are good, ugly, or unified.

This involves Amazon's talent strategy. More than 90% of Amazon's employees are junior programmers. They come from campus recruitment or have 1-2 years of work experience. If you want these people to really play their value, there are two ways to choose:

1. Spend one year training them so that they can handle the business independently.

2. Break them down into small and simple enough tasks, and provide them with powerful enough auxiliary tools, so that they can start to play their full value within 1-2 months. Amazon obviously chose the second way (think about it, they started oncall in the second week after joining the company. Without the support of powerful tools, it would be impossible to solve system problems.) Obviously, the second way is very unfriendly to engineers, but from the perspective of capital, it is a cheap way to exploit the labor of programmers. This also leads to Amazon's turn over rate being higher than Google and Facebook.

As an engineer, I really liked the way Google treated its engineers. However, after getting more involved in the business world, I felt that companies like Amazon and Uber, and even companies like Facebook, were more like normal business operations, while Google's overly idealistic approach was more like a research institute.

So when should companies start to pay attention to internal tools? According to the analysis of a Twitter engineer (the article is not available at the moment), when the company's engineering team exceeds 50 people, internal tools can start to improve the efficiency and engineering quality of the entire team. The companies compared above are all companies with very strong engineers. If your company's engineers are not that strong, it is even more important to use good tools to achieve continuous release.

Most companies in the US are very supportive and willing to develop internal tools, but in China, due to insufficient understanding of the value of tools or lack of long-term planning, they do not pay enough attention to them. I heard that every time Didi releases a new version, the CEO has to go on stage to mobilize everyone, which is very stressful. Engineers have to work overtime every day after releasing a new version. This example shows the difference.

<<:  How should we carry out continuous delivery of software based on containers (I)

>>:  Seven development tools for continuous integration and continuous delivery

Recommend

How to protect remote workers from cyber attacks?

[[400945]] During the coronavirus outbreak around...

This may be the correct way to open 5G

I wonder what you think 5G should look like? Fast...

What is the Internet? — Talking about the development of the Internet

A brief discussion on the Internet of Things (I):...

Kerlink and Radio Bridge provide LoRaWAN solutions for private IoT networks

According to recent announcements, Kerlink and Ra...

HTTP knowledge points, a must-know in the exam

Detailed introduction to http HTTP is the abbrevi...

LiFi has two major advantages over WiFi. Can it really replace WiFi?

Recently, the American company LightPointe announ...

From HTTP to HTTPS, it turns out to be so simple

[[354426]] 【51CTO.com original article】 HTTP Begi...

What should you know about 5G technology? What will happen in the future?

The most memorable coverage of 5G cellular networ...