This article is reprinted from the WeChat public account "Programmer jinjunzhu", author jinjunzhu. Please contact the programmer jinjunzhu public account to reprint this article. At 4 a.m., I was woken up by the company's monitoring alarm. The reason for the alarm was that a batch task in the production environment had a failure. I got up immediately to deal with the failure, but it still took a long time to solve it. This failure was a data verification task, which was to verify whether the data of the previous batch task was correct. Fortunately, the previous core task had been completed and did not affect the production transaction system. Why do I mention trading work here? Because the trading system is the entrance to the entire system's business traffic. If the trading system fails, it will cause direct revenue losses to the company. The topic we are talking about today is service governance. The ultimate result of service governance is that the system provides "7 * 24" hours of uninterrupted service. 1 Monitoring AlarmThe company's production alarm was very accurate this time. It found the direct maintainer of the system and notified which batch task had failed. This alarm was triggered by monitoring the task execution results of the batch task middleware. In general, what types of alarms are there? Let's look at the following figure: 1.1 Batch Processing EfficiencyIn most cases, batch processing tasks do not hinder business entry, so they do not need to be monitored. In the case of blocking business entry, batch processing tasks must be monitored. Let me give you two business scenarios:
In these scenarios, batch processing efficiency is a very important monitoring indicator, and timeout thresholds must be configured and monitored. 1.2 Traffic MonitoringCommonly used current limiting indicators are as follows: We need to pay attention to the following points when monitoring traffic:
1.3 Abnormal MonitoringException monitoring is very important for the system. It is difficult to ensure that the program does not have exceptions in the production environment. Properly configured exception alarms are crucial to quickly locate and solve problems. For example, the batch alarm mentioned at the beginning, the alarm information contains an exception, which allows me to quickly locate the problem. The following aspects should be noted for abnormal monitoring:
1.4 Resource UtilizationWhen configuring system resources in a production environment, it is generally necessary to have a prediction of the utilization rate of system resources. For example, how long will it take for Redis to run out of memory at the current memory growth rate, and how long will it take for the database to run out of disk at the current growth rate. System resources need to set a threshold, such as 70%, and an alarm will be triggered if the limit is exceeded. This is because when resource usage is close to saturation, processing efficiency will also be severely reduced. When configuring resource usage thresholds, be sure to consider sudden increases in traffic and business bursts and reserve additional resources in advance to cope with them. For core services, flow-limiting measures must be implemented to prevent sudden increases in traffic from overwhelming the system. 1.5 Request DelayRequest latency is not an easy metric to measure. The following figure is an e-commerce shopping system: In this diagram, we assume that the composite service will concurrently call the following order, inventory, and account services. After the client sends a request, it takes 2 seconds for the composite service to process the request, and 3 seconds for the account service to process the request. The minimum read timeout configured by the client is 5 seconds. The monitoring system needs to set a threshold for monitoring. For example, if the delay of 100 requests within 1 second is greater than 5 seconds, an alarm will be triggered to allow system maintenance personnel to find the problem. The read timeout set by the client cannot be too large. If a delay is caused by a server failure, fail-fast must be ensured to prevent system performance and service degradation due to the inability to release resources. 1.6 Monitoring considerations Monitoring is to enable system maintenance personnel to quickly discover production problems and locate the causes, but the monitoring system also has several indicators that need to be considered:
To avoid the long tail effect, it is best not to use the average value. As shown below: Among 10 image requests, 9 have a delay of 1 second, but 1 has a delay of 10 seconds, so the average value is not very meaningful for reference. You can group requests by interval, for example, the number of requests with a delay of less than 1 second, the number of requests with a delay of 1-2 seconds, and the number of requests with a delay of 2-3 seconds, and configure the monitoring threshold in an exponential growth manner. 2 Fault Management2.1 Common causes of failureThere are many reasons for the failure, but the most common ones are as follows:
2.2 Coping strategiesTo deal with failures, we proceed in two steps:
2.2.1 Software upgrade failure Some of the faults caused by the upgrade are exposed soon after going online, while others are exposed after a long time, for example, some business codes may not be executed before. For the first case, grayscale release can be used to verify the solution. For the second case, it is difficult to avoid it completely, we can only maximize the test case coverage. 2.2.2 Hardware resource failure These faults can be divided into two main categories:
For the first type of failure, we generally use monitoring alarms to notify the responsible person to handle it. The handling method is mainly to increase resources and find out the programs that consume serious resources and optimize them. For the second type of failure, operation and maintenance personnel are required to record and monitor hardware resources and replace aging resources in a timely manner. 2.3 System Overload System overload may be caused by a sudden increase in traffic such as flash sales, or it may gradually exceed the system's capacity as the business develops. You can deal with it by adding resources or limiting the flow. 2.4 Malicious Attacks There are many types of malicious attacks, such as DDOS attacks, malware, browser attacks, etc. There are many ways to prevent malicious attacks, such as encrypting request messages, introducing professional network security firewalls, regular security scans, and deploying core services on non-default ports. 2.5 Basic software failure As shown in the following figure, except for business services, each component is basic software and needs to consider high availability. 3 Release ManagementRelease usually refers to the upgrade of software and hardware, including business system version upgrade, basic software upgrade, hardware environment upgrade, etc. As a programmer, the upgrade discussed in this article is for the upgrade of the business system. 3.1 Release ProcessIn general, the business system upgrade process is as follows: Release to the production environment. If there are no problems with the verification, the release is successful. 3.2 Release Quality When upgrading software, release quality is very important. To ensure release quality, you need to pay attention to the following issues. 3.2.1 CheckList In order to ensure the quality of the release, a CheckList is maintained before the release, and the development team confirms all the issues. After the checklist is confirmed, the build is released. The following are some typical issues:
3.2.2 Grayscale Release Grayscale release refers to a release method that can smoothly transition between black and white. As shown in the following figure: When upgrading the image, the canary deployment method is adopted. First, one of the servers is used as a canary for release and upgrade. After this server runs in the production environment without any problems, other servers are upgraded. If there are any problems, rollback is performed. 3.2.2 Blue-Green Deployment The blue-green deployment method is as follows: Before the upgrade, client requests are sent to the green service. After the upgrade is released, the requests are transferred to the blue system through load balancing. The green system does not go offline temporarily. If there is no problem with the production test, the green system is taken offline, otherwise it is switched back to the green system. The difference between blue-green deployment and canary deployment is that canary deployment does not require the addition of new machines, while blue-green deployment is equivalent to adding a new set of machines, which requires additional resource costs. 3.2.4 ab test AB testing refers to releasing multiple versions in a production environment. The main purpose is to test the different effects of different versions. For example, different page styles and different operation processes allow users to choose their favorite version as the final version. As shown below: Services of three colors are deployed, and client requests are sent to services of the same color. The versions tested by ab have been verified to have no problems, which is different from grayscale release. 3.2.4 Configuration Changes Many times we write configuration in code, such as yaml files. In this way, we need to re-release a new version after modifying the configuration. If the configuration is modified frequently, you can consider the following two methods: Introducing the Configuration Center Using an external system to save configuration 4 Capacity ManagementSection 2.3 discusses system failures caused by system overload. Capacity management is an important part of ensuring stable operation of the system after it goes online. It is mainly to ensure that the system traffic does not exceed the threshold that the system can withstand and prevent the system from crashing. In general, the reasons for system capacity overload are as follows: The continuous increase in business brings increasing traffic to the system System resources shrink, for example, a new application is deployed on a machine, occupying some resources The system processes requests more slowly. For example, because the amount of data increases, the database responds more slowly, resulting in a longer processing time for a single request and the inability to release resources. Increase in requests due to retries A sudden increase in traffic, such as when the Weibo system encounters news about a celebrity divorce. 4.1 RetryRetrying some failed requests can greatly improve the user experience of the system. Retries are generally divided into two categories: one is a request for connection timeout, and the other is a request for response timeout. For requests with connection timeout, it may be caused by a transient network failure. In this case, retrying will not put pressure on the server because the failed request has never reached the server. However, if a request with a timed-out response is retried, it may bring additional pressure to the server. As shown in the following figure: Under normal circumstances, the client calls service A first, and service A then calls service B. Service B is only called once. If service B responds slowly and times out, the client is configured to retry twice on failure, and service A is also configured to retry twice on failure. If service B eventually fails to respond, service B is ultimately called 9 times. In a large distributed system, if the call chain is very long and each service is configured with retries, the retries will cause huge pressure on the downstream services in the call chain and even cause the system to crash. It can be seen that the more retries, the better. Reasonable retry settings can protect the system. For retrying, there are three suggestions: Non-core services do not retry. If they do retry, the number of times must be limited. The retry interval should increase exponentially Retry based on the returned failure status. For example, if the server defines a rejection code, the client will not retry. 4.2 Sudden Traffic Increase It is difficult to plan ahead for sudden increases in traffic. When encountering a sudden increase in traffic, we can first consider adding resources. Taking K8S as an example, if there are originally 2 pods, use deploy to orchestrate the expansion to 4 pods. The command is as follows:
If the resources have been used up, you have to consider limiting the flow. Here are some recommended flow limiting frameworks:
4.3 Capacity Planning It is very important to do a good job of capacity planning in the early stages of system construction. You can estimate the system's QPS based on the business volume and perform stress testing based on the QPS. The capacity estimated based on the stress test results may not necessarily be able to cope with real scenarios and emergencies in the production environment. You can reserve resources based on the estimated capacity, such as doubling the capacity. 4.4 Service Degradation There are three ways for the server to downgrade its service:
5 ConclusionMicroservice architecture brings many benefits to the system, but also brings some technical challenges. These challenges include service registration and discovery, load balancing, monitoring management, release upgrades, access control, etc. Service governance is to manage and prevent these problems to ensure the continuous and smooth operation of the system. The service governance solution described in this article is also a traditional solution. Sometimes there will be some code intrusion, and the choice of framework will also limit the programming language. In the cloud-native era, the emergence of Service Mesh has brought the topic of service governance into a new stage. I will share more about this later. |
<<: Network security attack and defense: wireless network security WEP
>>: A Preliminary Study on Kubernetes Network Concepts
Experts have been hyping up 5G's gigabit spee...
A computer network is a system of interconnected ...
1. What are single-mode and multi-mode optical fi...
On November 7, Li Zhengmao, general manager of Ch...
What is cloud computing? Different companies have...
[[342086]] This article is reprinted from the WeC...
The concept of the Internet of Things (IoT) has b...
The 2016 Huawei Dalian Software Development Cloud...
[Dubai , UAE , October 11, 2023 ] During the 2023...
In order to evolve towards cloud native and impro...
The broadband access at home now often starts at ...
DesiVPS has released two promotional packages, wh...
CMIVPS yesterday launched a 50% discount on the a...
In order to let new users understand and experien...
[[402116]] This article is reprinted from the WeC...