PrefaceHave you ever had this experience? A large amount of traffic suddenly hits your online system. It could be a hacker attack or the business volume is much larger than you expected. If your system does not take any protective measures, the system load will be too high, system resources will be slowly exhausted, and the interface will respond more and more slowly until it is unavailable. This will cause the upstream system that calls your interface to run out of resources, and eventually cause the system to crash. Just think about it, this is a catastrophic consequence. So what can be done? In the face of this kind of sudden traffic scenario, the core idea is to prioritize core businesses and the vast majority of users. There are four common response methods: downgrade, fuse, current limit and queuing. I will explain them one by one below. 1. DowngradeDowngrading means that the system reduces the functionality of certain services or interfaces. It can be to provide only some functions or to completely stop all functions, giving priority to core functions. For example, during the midnight rush for the Double 11 shopping spree on Taobao, you may find that the product return function is no longer available. Another example is that a forum can be downgraded to only being able to read posts but not post them; it can also be downgraded to only being able to read posts and comment but not post comments; There are two common ways to achieve downgrade:
In simple terms, the system reserves a backdoor for downgrade operations. For example, the system provides a downgrade URL. When you access this URL, it is equivalent to executing the downgrade instruction. The specific downgrade instruction can be passed in through the URL parameters. This solution has certain security risks, so security measures such as passwords will also be added to the URL. The system backdoor downgrade method has low implementation cost, but its main disadvantage is that if there are many servers, they need to be operated one by one, which is inefficient. This is a waste of time in the scenario where every second counts in troubleshooting.
In order to solve the shortcomings of the system backdoor downgrade method, we can separate the downgrade operation into a separate system to implement complex permission management, batch operation and other functions. The basic architecture is as follows: 2. Circuit BreakerFuse breaking means stopping access to external interfaces according to certain rules, such as if 60% of the request response errors occur within 1 minute, to prevent certain external interface failures from causing a sharp drop in system processing capacity or failure. Circuit breaker and downgrade are two concepts that are easy to confuse, because from the name alone, it seems that both mean to disable a certain function. But their connotations are different, because the purpose of downgrade is to deal with the failure of the system itself, while the purpose of circuit breaker is to deal with the failure of the external system it depends on. Regarding the implementation of service circuit breaking, there are two mainstream solutions,
3. Current LimitationEach system has a service online, so when the traffic exceeds the service limit, the system may get stuck or crash, so there are degradation and flow control. Flow control actually means: when there is high concurrency or instantaneous high concurrency, in order to ensure the stability and availability of the system, the system sacrifices some requests or delays the processing of requests to ensure the overall service availability of the system. Current limiting is generally implemented within the system. Common current limiting methods can be divided into two categories: request-based current limiting and resource-based current limiting.
Request-based current limiting refers to considering current limiting from the perspective of external access requests. There are two common methods. The first is to limit the total amount, that is, to limit the cumulative upper limit of a certain indicator. The most common one is to limit the total number of users served by the current system. For example, a certain live broadcast room is limited to 1 million users, and new users cannot enter after the number exceeds 1 million. There are only 100 items in a rush sale, and the upper limit of the number of users who can participate in the rush sale is 10,000. Users above 10,000 will be directly rejected. The second is to limit the amount of time, that is, to limit the upper limit of a certain indicator within a period of time, for example, only 10,000 users are allowed to access within 1 minute; the maximum peak request per second is 100,000. Whether it is limiting the total amount or limiting the time, the common feature is that it is easy to implement, but the main problem in practice is that it is difficult to find a suitable threshold. For example, the system is set to 10,000 users per minute, but in fact the system cannot handle 6,000 users; or when the number of users reaches 10,000 per minute, the system pressure is not great, but at this time, user access has begun to be discarded. Even if the appropriate threshold is found, request-based current limiting still faces hardware-related problems. For example, the processing power of a 32-core machine and a 64-core machine is very different, and the thresholds are different. Some technicians may think that they can simply perform mathematical calculations based on hardware indicators to obtain the threshold, but in fact this is not feasible. The business processing performance of a 64-core machine is not twice that of a 32-core machine, but may be 1.5 times, or even 1.1 times. In order to find a reasonable threshold, performance stress testing can usually be used to determine the threshold. However, performance stress testing also has the problem of limited coverage of scenarios. It is possible that a function that is not covered by the performance stress test will cause great pressure on the system. Another way is to optimize step by step: first set a threshold and then go online to observe the operation status. If it is unreasonable, adjust the threshold. Based on the above analysis, the method of limiting access volume based on thresholds is more suitable for systems with relatively simple business functions, such as load balancing systems, gateway systems, and rush purchase systems.
Request-based current limiting considers the system from the outside, while resource-based current limiting considers the system from the inside, that is, finding the key resources that affect the performance inside the system and limiting their usage. Common internal resources include the number of connections, file handles, number of threads, and request queues. For example, Netty is used to implement the server. Each incoming request is first placed in a queue, and the business thread then reads the request from the queue for processing. The maximum queue length is 10,000. If the queue is full, subsequent requests will be rejected. The flow can also be limited based on the CPU load or occupancy rate. When the CPU occupancy rate exceeds 80%, new requests will be rejected. Resource-based current limiting can more effectively reflect the current system pressure than request-based current limiting. However, it also faces two major difficulties in actual design: how to determine key resources and how to determine the thresholds of key resources. Normally, this is also a gradual tuning process: when designing, first select a key resource and threshold based on inference, then test and verify, then go online for observation, and if unreasonable, optimize again. 4. QueueEveryone must be familiar with the queuing method. When you buy train tickets on 12306, you will be told that you will have to wait in line for a while before you can lock the ticket and pay. At the end of the year, there were so many people buying tickets across China, and 12306 managed it through the queuing mechanism. But there are also disadvantages, that is, the user experience is not that good. Since queuing requires temporary caching of a large number of business requests, a single system cannot cache so much data. Generally, queuing needs to be implemented using an independent system, such as using a message queue such as Kafka to cache user requests.
Responsible for receiving users' snap-up requests and saving them in a first-in, first-out manner. Each product participating in the flash sale event saves a queue, and the size of the queue can be defined according to the number of products participating in the flash sale (or the amount of additional points).
Responsible for the dynamic scheduling from the queuing module to the service module, constantly checking the service module, and once the processing capacity is idle, it will transfer the user access request from the head of the queue to the service module, and is responsible for distributing the request to the service module. Here the scheduling module plays the role of an intermediary, but it does not just pass on the request, it also has the responsibility of adjusting the system's processing capacity. We can dynamically adjust the speed of pulling requests to the queuing system according to the actual processing capacity of the service module.
Responsible for calling the real business to process the service and returning the processing result, calling the interface of the queuing module to write back the business processing result. SummarizeFinally, we summarize the above four methods to ensure high availability of services through a table. Reference: |
>>: Which network IO model should RPC design use?
Last month, I shared information about Casbay and...
Gigabit LTE: The 4G solution for high-speed cellu...
When I first started working, one time, the guy w...
[[359421]] When using TCP protocol for communicat...
The next generation of connectivity is coming, pr...
On December 19, 2019, Hangzhou DPtech Co., Ltd. (...
This month, edgeNAT launched a new Korean native ...
DesiVPS acquired LosAngelesVPS a month ago and is...
Many companies are already using IoT data to hand...
At the "2017 China MEC Industry Development ...
[[407105]] On June 23, according to the "Eco...
[[375985]] [51CTO.com original article] "If ...
In the world of IoT, wireless communication techn...
On May 28, the Metrology and Testing Center of th...
Some time ago, I shared the information about spi...