What does service governance govern? 10 pictures tell you the answer

What does service governance govern? 10 pictures tell you the answer

[[392916]]

This article is reprinted from the WeChat public account "Programmer jinjunzhu", author jinjunzhu. Please contact the programmer jinjunzhu public account to reprint this article.

At 4 a.m., I was woken up by the company's monitoring alarm. The reason for the alarm was that a batch task in the production environment had a failure. I got up immediately to deal with the failure, but it still took a long time to solve it.

This failure was a data verification task, which was to verify whether the data of the previous batch task was correct. Fortunately, the previous core task had been completed and did not affect the production transaction system.

Why do I mention trading work here? Because the trading system is the entrance to the entire system's business traffic. If the trading system fails, it will cause direct revenue losses to the company.

The topic we are talking about today is service governance. The ultimate result of service governance is that the system provides "7 * 24" hours of uninterrupted service.

1 Monitoring Alarm

The company's production alarm was very accurate this time. It found the direct maintainer of the system and notified which batch task had failed. This alarm was triggered by monitoring the task execution results of the batch task middleware.

In general, what types of alarms are there? Let's look at the following figure:

1.1 Batch Processing Efficiency

In most cases, batch processing tasks do not hinder business entry, so they do not need to be monitored.

In the case of blocking business entry, batch processing tasks must be monitored. Let me give you two business scenarios:

  • The domain name system needs to find dirty data through DNS information and database records to compensate for transactions. During this period, the domain name information queried by the customer may be dirty data.
  • Real-time transactions are not allowed during the bank's end-of-day batch processing period, which is contrary to the "7 * 24" hour uninterrupted service

In these scenarios, batch processing efficiency is a very important monitoring indicator, and timeout thresholds must be configured and monitored.

1.2 Traffic Monitoring

Commonly used current limiting indicators are as follows:

We need to pay attention to the following points when monitoring traffic:

  • Different systems use different monitoring indicators. For example, for redis, you can use the QPS indicator, and for trading systems, you can use TPS.
  • Configure appropriate monitoring thresholds through testing and business volume estimation
  • The monitoring threshold needs to take into account unexpected situations, such as flash sales, coupon grabbing, etc.

1.3 Abnormal Monitoring

Exception monitoring is very important for the system. It is difficult to ensure that the program does not have exceptions in the production environment. Properly configured exception alarms are crucial to quickly locate and solve problems. For example, the batch alarm mentioned at the beginning, the alarm information contains an exception, which allows me to quickly locate the problem.

The following aspects should be noted for abnormal monitoring:

  • Client read timeout, then find out the reason from the server as soon as possible
  • Set a threshold for the time it takes for the client to receive a response, such as 1 second, and trigger an alarm if it exceeds this threshold.
  • Business anomalies must be monitored, such as failure response codes

1.4 Resource Utilization

When configuring system resources in a production environment, it is generally necessary to have a prediction of the utilization rate of system resources. For example, how long will it take for Redis to run out of memory at the current memory growth rate, and how long will it take for the database to run out of disk at the current growth rate.

System resources need to set a threshold, such as 70%, and an alarm will be triggered if the limit is exceeded. This is because when resource usage is close to saturation, processing efficiency will also be severely reduced.

When configuring resource usage thresholds, be sure to consider sudden increases in traffic and business bursts and reserve additional resources in advance to cope with them.

For core services, flow-limiting measures must be implemented to prevent sudden increases in traffic from overwhelming the system.

1.5 Request Delay

Request latency is not an easy metric to measure. The following figure is an e-commerce shopping system:

In this diagram, we assume that the composite service will concurrently call the following order, inventory, and account services. After the client sends a request, it takes 2 seconds for the composite service to process the request, and 3 seconds for the account service to process the request. The minimum read timeout configured by the client is 5 seconds.

The monitoring system needs to set a threshold for monitoring. For example, if the delay of 100 requests within 1 second is greater than 5 seconds, an alarm will be triggered to allow system maintenance personnel to find the problem.

The read timeout set by the client cannot be too large. If a delay is caused by a server failure, fail-fast must be ensured to prevent system performance and service degradation due to the inability to release resources.

1.6 Monitoring considerations

Monitoring is to enable system maintenance personnel to quickly discover production problems and locate the causes, but the monitoring system also has several indicators that need to be considered:

  • Specify the sampling frequency of monitoring indicators based on the monitoring objectives. Too high a frequency will increase the monitoring cost.
  • Monitoring coverage should preferably cover all core system indicators.
  • Monitoring effectiveness. The more monitoring indicators, the better. Too many monitoring indicators will bring extra workload to distinguish the effectiveness of alarms and will also make developers accustomed to it.
  • Alarm timeliness: For non-real-time transaction systems such as batch tasks, real-time alarms are not necessary. Instead, a time can be set after recording the event. For example, an alarm is triggered at 8 a.m., and the responsible person will arrive at the company to handle it.

To avoid the long tail effect, it is best not to use the average value. As shown below:

Among 10 image requests, 9 have a delay of 1 second, but 1 has a delay of 10 seconds, so the average value is not very meaningful for reference.

You can group requests by interval, for example, the number of requests with a delay of less than 1 second, the number of requests with a delay of 1-2 seconds, and the number of requests with a delay of 2-3 seconds, and configure the monitoring threshold in an exponential growth manner.

2 Fault Management

2.1 Common causes of failure

There are many reasons for the failure, but the most common ones are as follows:

  • Release upgrade failures
  • Hardware resource failure
  • System overload
  • Malicious Attacks
  • Basic service failure

2.2 Coping strategies

To deal with failures, we proceed in two steps:

  • Resolve faults immediately. For example, if a fault is caused by a data problem, just modify the problematic data.
  • Find out the cause of the failure. You can locate the problem and solve it by looking for logs or calling the chain tracking system.

2.2.1 Software upgrade failure

Some of the faults caused by the upgrade are exposed soon after going online, while others are exposed after a long time, for example, some business codes may not be executed before.

For the first case, grayscale release can be used to verify the solution.

For the second case, it is difficult to avoid it completely, we can only maximize the test case coverage.

2.2.2 Hardware resource failure

These faults can be divided into two main categories:

  • Hardware resource overload, such as insufficient memory
  • Hardware resource aging

For the first type of failure, we generally use monitoring alarms to notify the responsible person to handle it. The handling method is mainly to increase resources and find out the programs that consume serious resources and optimize them.

For the second type of failure, operation and maintenance personnel are required to record and monitor hardware resources and replace aging resources in a timely manner.

2.3 System Overload

System overload may be caused by a sudden increase in traffic such as flash sales, or it may gradually exceed the system's capacity as the business develops. You can deal with it by adding resources or limiting the flow.

2.4 Malicious Attacks

There are many types of malicious attacks, such as DDOS attacks, malware, browser attacks, etc.

There are many ways to prevent malicious attacks, such as encrypting request messages, introducing professional network security firewalls, regular security scans, and deploying core services on non-default ports.

2.5 Basic software failure

As shown in the following figure, except for business services, each component is basic software and needs to consider high availability.

3 Release Management

Release usually refers to the upgrade of software and hardware, including business system version upgrade, basic software upgrade, hardware environment upgrade, etc. As a programmer, the upgrade discussed in this article is for the upgrade of the business system.

3.1 Release Process

In general, the business system upgrade process is as follows:

Release to the production environment. If there are no problems with the verification, the release is successful.

3.2 Release Quality

When upgrading software, release quality is very important. To ensure release quality, you need to pay attention to the following issues.

3.2.1 CheckList

In order to ensure the quality of the release, a CheckList is maintained before the release, and the development team confirms all the issues. After the checklist is confirmed, the build is released. The following are some typical issues:

  • Is the online SQL correct?
  • Are the configuration items of the production configuration file complete?
  • Have the externally dependent services been published and verified?
  • Has the new machine routing permission been enabled?
  • Is the release order of multiple services clear?
  • What to do if a failure occurs after going online

3.2.2 Grayscale Release

Grayscale release refers to a release method that can smoothly transition between black and white. As shown in the following figure: When upgrading the image, the canary deployment method is adopted. First, one of the servers is used as a canary for release and upgrade. After this server runs in the production environment without any problems, other servers are upgraded. If there are any problems, rollback is performed.

3.2.2 Blue-Green Deployment

The blue-green deployment method is as follows:

Before the upgrade, client requests are sent to the green service. After the upgrade is released, the requests are transferred to the blue system through load balancing. The green system does not go offline temporarily. If there is no problem with the production test, the green system is taken offline, otherwise it is switched back to the green system.

The difference between blue-green deployment and canary deployment is that canary deployment does not require the addition of new machines, while blue-green deployment is equivalent to adding a new set of machines, which requires additional resource costs.

3.2.4 ab test

AB testing refers to releasing multiple versions in a production environment. The main purpose is to test the different effects of different versions. For example, different page styles and different operation processes allow users to choose their favorite version as the final version. As shown below:

Services of three colors are deployed, and client requests are sent to services of the same color.

The versions tested by ab have been verified to have no problems, which is different from grayscale release.

3.2.4 Configuration Changes

Many times we write configuration in code, such as yaml files. In this way, we need to re-release a new version after modifying the configuration. If the configuration is modified frequently, you can consider the following two methods:

Introducing the Configuration Center

Using an external system to save configuration

4 Capacity Management

Section 2.3 discusses system failures caused by system overload. Capacity management is an important part of ensuring stable operation of the system after it goes online. It is mainly to ensure that the system traffic does not exceed the threshold that the system can withstand and prevent the system from crashing. In general, the reasons for system capacity overload are as follows:

The continuous increase in business brings increasing traffic to the system

System resources shrink, for example, a new application is deployed on a machine, occupying some resources

The system processes requests more slowly. For example, because the amount of data increases, the database responds more slowly, resulting in a longer processing time for a single request and the inability to release resources.

Increase in requests due to retries

A sudden increase in traffic, such as when the Weibo system encounters news about a celebrity divorce.

4.1 Retry

Retrying some failed requests can greatly improve the user experience of the system. Retries are generally divided into two categories: one is a request for connection timeout, and the other is a request for response timeout.

For requests with connection timeout, it may be caused by a transient network failure. In this case, retrying will not put pressure on the server because the failed request has never reached the server.

However, if a request with a timed-out response is retried, it may bring additional pressure to the server. As shown in the following figure: Under normal circumstances, the client calls service A first, and service A then calls service B. Service B is only called once.

If service B responds slowly and times out, the client is configured to retry twice on failure, and service A is also configured to retry twice on failure. If service B eventually fails to respond, service B is ultimately called 9 times.

In a large distributed system, if the call chain is very long and each service is configured with retries, the retries will cause huge pressure on the downstream services in the call chain and even cause the system to crash. It can be seen that the more retries, the better. Reasonable retry settings can protect the system.

For retrying, there are three suggestions:

Non-core services do not retry. If they do retry, the number of times must be limited.

The retry interval should increase exponentially

Retry based on the returned failure status. For example, if the server defines a rejection code, the client will not retry.

4.2 Sudden Traffic Increase

It is difficult to plan ahead for sudden increases in traffic.

When encountering a sudden increase in traffic, we can first consider adding resources. Taking K8S as an example, if there are originally 2 pods, use deploy to orchestrate the expansion to 4 pods. The command is as follows:

  1. kubectl scale deployment springboot-deployment --replicas=4  

If the resources have been used up, you have to consider limiting the flow. Here are some recommended flow limiting frameworks:

  • google guava
  • netflix/concurrency-limits
  • sentinel

4.3 Capacity Planning

It is very important to do a good job of capacity planning in the early stages of system construction.

You can estimate the system's QPS based on the business volume and perform stress testing based on the QPS. The capacity estimated based on the stress test results may not necessarily be able to cope with real scenarios and emergencies in the production environment. You can reserve resources based on the estimated capacity, such as doubling the capacity.

4.4 Service Degradation

There are three ways for the server to downgrade its service:

  • When the server capacity is overloaded, new requests are rejected directly
  • Non-core services are suspended to reserve resources for core services
  • The client can perform downgrade processing based on the proportion of requests rejected by the server. For example, if the server rejects 100 out of 1,000 requests in one minute, the client can use this as a reference and directly reject requests if they exceed 90 per minute.

5 Conclusion

Microservice architecture brings many benefits to the system, but also brings some technical challenges. These challenges include service registration and discovery, load balancing, monitoring management, release upgrades, access control, etc. Service governance is to manage and prevent these problems to ensure the continuous and smooth operation of the system.

The service governance solution described in this article is also a traditional solution. Sometimes there will be some code intrusion, and the choice of framework will also limit the programming language.

In the cloud-native era, the emergence of Service Mesh has brought the topic of service governance into a new stage. I will share more about this later.

<<:  Network security attack and defense: wireless network security WEP

>>:  A Preliminary Study on Kubernetes Network Concepts

Recommend

Network | Where is the United States in its race to seize the 5G market?

Experts have been hyping up 5G's gigabit spee...

Seven types of networks and their use cases

A computer network is a system of interconnected ...

The difference between single-mode fiber and multi-mode fiber and how to choose

1. What are single-mode and multi-mode optical fi...

5G accelerates cloud-network integration

What is cloud computing? Different companies have...

F5G, not so mysterious

[[342086]] This article is reprinted from the WeC...

Four major battles of the Internet of Things broke out in 2018

The concept of the Internet of Things (IoT) has b...

Do you know some new features of RocketMQ 5.0? Let me tell you.

In order to evolve towards cloud native and impro...

Why does TCP need three handshakes and four waves?

[[402116]] This article is reprinted from the WeC...