Current limiting is never an easy task!

Current limiting is never an easy task!

[[354146]]

This article is reprinted from the WeChat public account "Coffee Latte", the author is Coffee Latte. Please contact the Coffee Latte public account to reprint this article.

background

With the popularity of microservices, the stability between services has become more and more important. We often spend a lot of time on maintaining the stability of services. Current limiting and circuit breaking and downgrading are the two most commonly used methods. Some time ago, some friends in the group had some questions about the use of current limiting. In addition, the company has also done things related to current limiting during the recent promotion, so I would like to summarize and write some of my views on current limiting here.

I just said that flow limiting is one of the means we use to ensure service stability, but it does not guarantee stability in all scenarios. Just like its name, it can only play its role in scenarios with large or burst traffic. For example, our system supports up to 100QPS, but suddenly a 1000QPS request comes in. The system may hang directly at this time, resulting in the failure to process any subsequent requests. However, if we have flow limiting means, no matter how large the QPS is, we only process 100QPS requests and directly reject other requests. Although we reject the 900QPS request, our system does not hang, and our system can still continuously process subsequent requests, which is what we expect. Some students may say that since we are now on the cloud, dynamic scaling of services should be very simple. If we find that the traffic is very large, we can automatically expand the machine to support the target QPS, so there is no need to limit the flow. In fact, there should be quite a lot of students who have this idea. Some students may have been fooled by some boastful articles, so they think so. This idea can be realized when it is particularly ideal, but in reality there are actually the following problems:

  • Capacity expansion takes time. Simply put, capacity expansion means getting a new machine and then re-releasing the code. Java students should know that the time it takes to successfully release a code is generally not calculated in seconds, but in minutes. Sometimes, by the time you have completed capacity expansion, the traffic peak may have passed.
  • How much to expand is a particularly complex question. Expanding a few machines is relatively complicated and requires a lot of stress testing and calculations, as well as expansion of the entire link. If you expand the machines on your side, but the machines of other teams are not expanded, there may still be bottlenecks in the end, which is also a problem.

Therefore, simply expanding capacity cannot solve this problem. Current limiting is still a skill we must master!

Rationale

If you want to master current limiting, you need to master some of its basic algorithms first. There are basically three types of current limiting algorithms: counter, funnel, and token bucket. The others are evolved on these bases.

Counter algorithm

First, let's talk about the counter algorithm. This algorithm is relatively simple and crude. We only need an accumulated variable, and then refresh this accumulated variable every second, and then determine whether this accumulated variable is greater than our maximum QPS.

  1. int curQps = 0;
  2. long lastTime = System.currentTimeMillis();
  3. int maxQps = 100;
  4. Object lock = new Object();
  5. boolean check (){
  6. synchronized (lock) {
  7. long now = System.currentTimeMillis();
  8. if (now - lastTime > 1000){
  9. lastTime = now;
  10. curQps = 0;
  11. }
  12. curQps++;
  13. if (curQps > maxQps){
  14. return   false ;
  15. }
  16. }
  17. return   true ;
  18. }

This code is relatively simple. We define the current qps, the time of the last refresh of the accumulated variable, our maximum qps and our lock. Every time we check, we need to determine whether a refresh is needed. If a refresh is needed, we need to reset both the time and qps, and then perform the qps accumulation judgment.

This algorithm is too simple so the problems it brings are particularly obvious. If our maximum qps is 100, 100 requests come in 0.99 seconds, and then another 100 requests come in 1.01 seconds, this can pass our program, but we actually passed 200 requests within 0.03 seconds, which is definitely not in line with our expectations, because it is very likely that these 200 requests will directly crash our machine.

Sliding Window Counter

In order to solve the critical problem above, we can use a sliding window to solve this problem:

As shown in the figure above, we divide the 1s ordinary counter into five 200ms periods. The current QPS we count needs to count all the QPS in the five most recent windows. Back to the previous question, 0.99 seconds and 1.01 seconds are actually within our five most recent windows, so the critical spike problem mentioned above will not occur here.

In fact, if you think about it from another angle, our ordinary counter is actually a sliding window counter with a window number of 1. The more windows we divide, the more accurate the statistics will be when we use the counter solution, but the cost of maintaining the window will increase relatively. When we introduce Sentinel later, we will explain in detail how it implements sliding window counting.

Funnel algorithm

Solving the critical spike problem in the counter can also be achieved through the funnel algorithm, as shown in the following figure:

In the funnel algorithm, we need to pay attention to the leaky bucket and uniform outflow. No matter how large the traffic is, it will first go into the leaky bucket and then flow out at a uniform speed. How to achieve this uniform speed in the code? For example, if we want the uniform speed to be 100q/s, then we can get that it takes 10ms for each outflow of traffic, which is similar to a queue. Every 10ms, the traffic is taken from the head of the queue for release. Our queue is the leaky bucket. When the traffic is greater than the length of the queue, we can reject the excess part.

The funnel algorithm also has certain disadvantages: it cannot handle burst traffic (different from the critical spike above, don’t confuse them). For example, if 100 requests come in at once, they can only be processed one by one in the leaky bucket algorithm. When the last request comes out, one second has passed. Therefore, the funnel algorithm is more suitable for scenarios where requests arrive evenly and the request rate needs to be strictly controlled.

Token Bucket Algorithm

In order to solve the burst traffic situation, we can use the token bucket algorithm, as shown in the following figure:

There are three stages to focus on in this diagram:

  • Production token: Here we still assume that the maximum qps is 100, so we convert the traffic passing through the funnel every 10ms into producing a token every 10ms until the maximum token is reached.
  • Consume tokens: Each of our flows will consume a token bucket. The consumption rules here can be varied. It can be a simple one token consumed for each flow, or different consumption rules can be set according to different flow packet sizes or flow types. For example, query flow consumes 1 token and write flow consumes 2 tokens.
  • Determine whether it passes: If the token bucket is sufficient, we allow the traffic to pass. If it is not sufficient, we can wait or reject it directly. This can be controlled by a queue like a funnel.

Single machine current limiting

We have introduced some basic current limiting algorithms above. When we apply these algorithms to our distributed services, we can divide them into two types: single-machine current limiting and cluster current limiting. Single-machine current limiting means that each machine does its own current limiting without affecting each other. Next, let's see how to implement single-machine current limiting.

guava

Guava is Google's open source Java core tool library, which includes collections, caches, concurrency and other useful tools. Of course, it also provides the current limiting tools we need here. The core class is RateLimiter.

  1. // RateLimiter rateLimiter = RateLimiter.create ( 100, 500, TimeUnit.MILLISECONDS); Preheated rateLimit
  2. RateLimiter rateLimiter = RateLimiter.create (100); // Simple rateLimit
  3. boolean limitResult = rateLimiter.tryAcquire();

The usage is relatively simple. As shown in the code above, we only need to build a RateLimiter and then call the tryAcquire method. If the return value is true, it means that the traffic is passing through. Otherwise, the traffic is limited. In guava, RateLimiter is divided into two types. One is the implementation of the ordinary token bucket algorithm, and the other is the RateLimiter with preheating, which allows us to gradually increase the release speed of the token bucket until it reaches the maximum. This preheating is also available in sentinel. This can be used in some cold systems, such as when the database connection pool is not fully filled and is still being initialized.

Here I will just briefly introduce how to implement Guava's token bucket:

The ordinary token bucket creates a SmoothBursty class, which is the key to our current limiting. How to do current limiting in our tryAcquire:

There are four steps here:

  • Step 1: Add a synchronization lock. Please note that there is no lock in sentinel, but there is one in guava. Some issues of sentinel will be discussed later.
  • Step 2: Determine whether you can apply for a token bucket. If there are not enough tokens in the bucket and the waiting time exceeds our timeout, we will not apply here.
  • Step 3: Apply for a token and get the waiting time. The timeout parameter in our tryAcquire is our maximum waiting time. If we just call tryAcquire(), there will be no waiting, and the second step will fail quickly.
  • Step 4: sleep waiting time.

The method for deducting tokens is specifically in the reserveEarliestAvailable method:

Although it seems that there are many processes here, if we just call tryAcquire(), we only need to pay attention to the two red boxes:

  • Step 1: Issue tokens based on the latest time. In guava, we do not use other threads to issue tokens asynchronously. Instead, we update tokens every time we call the current limiting method. This design is worth learning. In many cases, we do not necessarily need asynchronous threads to execute and can achieve our desired goals, and there is no complexity of asynchronous threads.
  • Step 2: Deduct the token. We have already verified it in canAcquire, and the token can be deducted successfully.

Guava's current limiting currently provides these two methods of current limiting. Many middleware or business services use Guava's current limiting as their own tool, but Guava's method is relatively limited. It does not support dynamic changes in current limiting and more policy-based current limiting, so we will introduce sentinel next.

sentinel

Sentinel is a lightweight traffic control framework of Alibaba's open source distributed service framework. It has taken over the core scenarios of Alibaba's Double 11 promotion traffic in the past 10 years. Its core is traffic control but it is not limited to traffic control. It also supports circuit breaking, downgrade, monitoring, etc.

The current limiting of sentinel is slightly more complicated than that of guava. Here is the simplest code:

  1. String KEY = "test" ;
  2. // =============== Initialization rules =========
  3. List<FlowRule> rules = new ArrayList<FlowRule>();
  4. FlowRule rule1 = new FlowRule();
  5. rule1.setResource( KEY );
  6. // set limit qps to 20
  7. rule1.setCount(20);
  8. rule1.setGrade(RuleConstant.FLOW_GRADE_QPS);
  9. rule1.setLimitApp( "default" );
  10. rules.add (rule1);
  11. rule1.setControlBehavior(CONTROL_BEHAVIOR_DEFAULT);
  12. FlowRuleManager.loadRules(rules);
  13. // ================= Current limiting determination ============
  14. Entry entry = null ;
  15.  
  16. try {
  17. entry = SphU.entry( KEY );
  18. // do something
  19.  
  20. } catch (BlockException e1) {
  21. // Current limiting will throw a BlockException
  22. }finally {
  23. if (entry != null ) {
  24. entry.exit();
  25. }
  26. }
  • Step 1: Sentinel emphasizes the concept of Resource. What we protect or act on is based on Resource, so we first need to determine the key of our Resource. Here we simply set it to test.
  • Step 2: Then we initialize a current limiting rule for our Resource. We choose the QPS current limiting and the default strategy. The default is to use the sliding window version of the counter, and then load it into the global rule manager. The entire rule setting is quite different from guava.
  • Step 3: The second important concept in sentinel is Entry. Entry represents a resource operation, which will store the current invocation information internally and need to be exited in finally. When we perform current limiting judgment, we actually get Entry. SphU.entry is the key to executing the current limiting rules above. It is different from guava. If the current is limited, BlockException will be thrown. We are processing the current limiting.

Although the use of sentinel is much more complicated than guava, the algorithm options are slightly more than guava's current limiting.

Based on the number of concurrent threads

What we introduced before are all based on QPS. Sentinel provides a strategy based on the number of concurrency, which has an effect similar to semaphore isolation. When we need to prevent the business thread pool from being exhausted by slow calls, we can use this mode.

Generally speaking, the http interface provided by the same service uses a thread pool. For example, if we use the tomcat-web server, we will have a tomcat business thread pool. If there are two methods A and B in http, B is relatively fast and A is relatively slow. If A is called a large number of times, the thread cannot be released because A is too slow, which may cause the thread pool to be exhausted and the other method B cannot get a thread. We have encountered this scenario before, which directly led to all requests received by the entire service being rejected. Some students said that limiting the QPS of A is enough. It should be noted that QPS is per second. If the time consumption of our A interface is greater than 1s, then the QPS needs to be recalculated after the next wave of A comes.

Based on this, current limiting based on the number of concurrent connections is provided. We set the Grade to FLOW_GRADE_THREAD to implement this current limiting mode.

Based on QPS

The QPS-based current limiting sentinel also provides four strategies:

  • Default strategy: Set Behavior to CONTROL_BEHAVIOR_DEFAULT, which is a sliding window counter mode. This method is suitable for situations where the system processing capacity is known, such as when the exact water level of the system is determined through stress testing.
  • Warm Up: Set the Behavior to CONTROL_BEHAVIOR_WARM_UP, similar to the warmup introduced in guava. Warm-up startup method. When the system is at a low water level for a long time, when the traffic suddenly increases, directly raising the system to a high water level may instantly overwhelm the system. The QPS curve in this mode is as follows:

  • Uniform queuing: Set Behavior to CONTROL_BEHAVIOR_RATE_LIMITER. This mode is actually the funnel algorithm. The advantages and disadvantages have been explained before.
  • Warm Up + uniform queuing: Set Behavior to CONTROL_BEHAVIOR_WARM_UP_RATE_LIMITER. Previously, after warming up to the high water mark, the sliding window algorithm was used to limit the flow. In this mode, the uniform queuing algorithm continues to be used.

Based on call relationship

Sentinel provides a more complex current limiting method, which can be more flexible based on the calling relationship:

  • Limit the flow based on the caller: The use of the caller's flow limit is more complicated. You need to call ContextUtil.enter(resourceName, origin), where origin is our caller ID. Then, when setting parameters in our rule, you can set limitApp to limit the flow of the caller:
    • Set to default, and all callers are limited by default.
    • Set to {some_origin_name}, which means that only specific callers are limited.
    • If it is set to other, the current will be limited except for the caller represented by a configured referResource parameter.

Associated traffic control: This is also supported in Sentinel. Two associated resources can affect each other's traffic control. For example, if two interfaces use the same resource, one interface is more important and the other is not so important, we can set a rule that when the important interface is accessed in large quantities, the other unimportant interface can be restricted to prevent the sudden traffic on this interface from affecting the important interface.

Some issues with sentinel

Although Sentinel provides so many algorithms, it also has some problems:

  • First of all, it is difficult to get started with sentinel. Compared with the two lines of code of guava, using sentinel requires understanding some nouns, and then using them based on these nouns. Although sentinel provides some annotations to help us simplify the use, it is still more complicated than guava overall.
  • Sentinel has certain operation and maintenance costs. The use of sentinel often requires the construction of a sentinel server background. Compared with Guava's out-of-the-box use, it has certain operation and maintenance costs.
  • Sentinel's current limiting statistics have certain concurrency issues. There is no lock in the sentinel source code. In extreme cases, if the QPS limit is 10, if there are 100 logics that pass the current limiting at the same time, they will all pass at this time, but this will not happen in Guava.

These problems are basically compared with Guava's current limiting. After all, Sentinel has more functions and the cost is relatively higher.

Cluster current limiting

All the current limiting mentioned before is single-machine current limiting, but we are now using the microservice cluster architecture model. Usually a service will have multiple machines. For example, there is an order service, and this service has 10 machines. Then we want to limit the entire cluster to 500QPS. How should we do it? This is very simple. Just limit the current of each machine to 50. 50*10 is 500. However, in the real environment, there will be load imbalance. When microservices are called, there are various load balancing algorithms, such as same-computer room priority, round-robin, random and other algorithms. These algorithms may cause our load to be not particularly balanced, which will cause the QPS of our entire cluster to be less than 500, or even be limited at 400. This is what we have encountered in real scenarios. Since there is a problem with single-machine current limiting, we should design a more complete cluster current limiting solution.

Redis

This solution does not rely on the current limiting framework. Our entire cluster can use the same redis. We need to encapsulate the current limiting logic ourselves. Here we use the simplest counter to design it. We use our system time in seconds as the key and set it to redis (a certain expiration time can be set for space cleaning). Using redis's int atomic addition, each request is added by 1, and then it is determined whether the current value exceeds the maximum value of our current limit.

The redis solution is generally simple to implement, but it is highly dependent on our system time. If the system time between different machines deviates, the current limiting may be inaccurate.

sentinel

Sentinel provides a cluster solution, which is quite unique compared to other current limiting frameworks. Sentinel provides two modes:

  • Independent mode: The current limiting service is deployed as a separate server, as shown in the figure below. All applications obtain tokens from the separately deployed token-server. This mode is suitable for global current limiting between services. For example, in the figure below, both A and B go to the token-server to get tokens. This scenario is generally rare, and more often the current limiting is done in the cluster within the service.
  • Embedded mode: In embedded mode, we deploy the server to our application instance. We can also convert our server-client identity through the interface. Of course, we can introduce some logic settings of zk to let our leader act as the server, and the machine can be automatically switched when it crashes. This is more suitable for limiting the flow between the same service clusters and has better flexibility, but it should be noted that a large number of token-server accesses may also affect our own machines.

Of course, Sentinel also has some fallback strategies. If the token-server fails, we can regress to our single-machine current limiting mode, which will not affect our normal services.

Actual Combat

We have introduced many tools for limiting traffic, but many students are still confused about how to limit traffic. If we limit traffic for a scene or a resource, there are several points that need to be confirmed:

  • Where to limit current?
  • How much flow is limited
  • How to choose tools

Where to limit current?

This is a complicated question. Many companies and teams have different approaches. When I was at Meituan, we launched a wave of SOA. I remember that at that time, all services and all interfaces needed to be throttled. Each team was asked to evaluate a reasonable QPS upper limit for the interface. In theory, this is correct. We should give an upper limit to each interface to prevent the overall system from being dragged down. However, the cost of doing so is very high, so most companies still choose to do throttling.

First, we need to determine some core interfaces, such as ordering and payment in the e-commerce system. If the traffic is too large, there will be problems with the paths in the e-commerce system. For example, for edge interfaces such as reconciliation (which do not affect the core path), we can choose not to set flow limits.

Secondly, we don’t necessarily limit the flow only at the interface layer. In many cases, we limit the flow directly at the gateway layer to prevent the flow from further penetrating into the core system. Of course, the front end can also limit the flow. When the front end captures the error code of the flow limit, the front end can prompt to wait for information, which is actually part of the flow limit. In fact, the more the flow limit is triggered downstream, the greater the waste of our resources, because the upstream has done a lot of work before the downstream flow limit. If the flow limit is triggered at this time, then the previous work will be in vain. If some rollback work is involved, it will increase our burden. Therefore, for the flow limit, the higher the trigger, the better.

How much flow is limited

Most of the time, the question of how much traffic to limit is probably a historical experience value. We can use the daily QPS monitoring chart, and then add a little redundant QPS to this contact, which may be our current limit. However, there is a scenario that needs attention, that is, during a big promotion (here refers to the scenario in the e-commerce system, and other systems are similar to the scenario with higher traffic), the traffic of our system will increase suddenly, and it will no longer be our daily QPS. In this case, we often need to perform full-link stress testing on our system before the big promotion to get a reasonable upper limit, and then set the current limit based on this upper limit.

How to choose tools

Generally speaking, larger Internet companies have their own unified current limiting tools, so you can just use them here. For other situations, if there is no need for cluster current limiting or circuit breaking, I personally think that RateLimter is a good choice, as it is relatively simple to use and has basically no learning cost. If there are other needs, I personally think that Sentinel should be chosen. As for Hytrix, I personally do not recommend it because it is no longer maintained.

Summarize

Although there are only two words "current limiting", it is not easy to truly understand and do a good job of current limiting. For me personally, this article is just some superficial insights. If you have any better opinions, you can follow my official account and leave a message for discussion.

<<:  DAMO's full-stack data solution is grandly launched, opening a new journey for domestic databases

>>:  Is there still a market for pure 4G mobile phones?

Recommend

The Matter protocol is rising rapidly. Do you really understand it?

The topic we are going to talk about today is rel...

Performance improvements of Http/2 compared to Http/1.1

What has changed since HTTP/1.1 was invented? HTT...

Interview blitz: Is TCP reliable? Why?

Author | Lei Ge Source | Java interview questions...

What is the difference between 5G and 5GHz Wi-Fi?

Are 5G and 5 GHz Wi-Fi the same thing? No, but te...

Three major dilemmas of big data

【51CTO.com Quick Translation】 Big data, as a set ...

...

Model application in anti-fraud risk control of advertising traffic

1. Introduction to Ad Anti-Cheat 1.1 Definition o...