Double 11 Carnival, drink this bowl of "traffic control" soup

Double 11 Carnival, drink this bowl of "traffic control" soup

[[350322]]

As the Double 11 shopping festival approaches, the transaction volume was only 50 million in 2009, but reached 268.4 billion in 2019. This year marks the 12th Double 11 shopping festival, and I am excited just thinking about it.

Alibaba people like to regard Double Eleven as Team Building. There is a widely circulated saying: Fighting is the best team building. Those who have not participated in Double Eleven are called colleagues, and those who have participated in Double Eleven are called comrades-in-arms.

In the previous article, I explained the mechanism of service avalanche and circuit breaker through the story of the Three Kingdoms, and I created a wheel myself: the circuit breaker. This article will explain two traffic control components used by first-tier companies: Sentinel and Hystrix, as well as their horizontal comparison and how to choose.

The main contents of this article are as follows:

This article has been included in my Github: https://github.com/Jackson0714/PassJava-Learning

1. Circuit Breaker, Degradation, Current Limitation, and Isolation

When facing high concurrent traffic, we usually use four methods (fuse breaking, downgrading, current limiting, and isolation) to prevent the impact of instantaneous large traffic on the system. The two traffic guards to be introduced today are specifically used in this regard. Let me first give you a brief introduction.

What is a circuit breaker?

Circuit breaker scenario diagram @Wukong Chat Architecture

Keyword: circuit breaker protection. For example, if service A calls service B, and the request takes too long due to network problems, service B downtime, or long processing time of service B, if this happens multiple times within a certain period of time, B can be directly disconnected (A no longer requests B). The request to call service B directly returns degraded data without waiting for the execution of service B. Therefore, the problem of service B will not cascade to service A.

What is downgrade?

Degraded scenario diagram @Wukong Chat Architecture

Keyword: return downgrade data. The website is at a peak traffic period, and the server pressure increases dramatically. According to the current business situation and traffic, some services and pages are strategically downgraded (stop services, and all calls directly return downgrade data). This relieves the pressure on server resources, ensures the normal operation of core businesses, and ensures that customers and most customers receive the correct response. Degraded data can be simply understood as a quick return of false, and the front-end page tells the user "The server is currently busy, please try again later."

What is current limiting?

Current limiting scenario diagram @Wukong Chat Architecture

Control the flow of requests and only release some requests so that the service can bear the traffic pressure that does not exceed its capacity.

What are the similarities between circuit breaker and downgrade?

  • Both circuit breakers and current limiters are designed to ensure the availability and reliability of most services in the cluster and prevent core services from crashing.
  • The end user feels that a certain function is not available.

What is the difference between circuit breaker and downgrade?

  • A circuit breaker is an operation that is actively triggered when a fault occurs on the called party.
  • Downgrading is based on global considerations, stopping certain normal services and releasing resources.

What is quarantine?

  • Each service is regarded as an independently running system. Even if there is a problem in one system, it will not affect other services.

2. Hystrix

What is Hystrix

Hystrix: A framework for high availability assurance. Produced by Netflix (Netflix can be understood as domestic video websites such as iQiyi).

History of Hystrix

  • In 2011, the API team developed the Hystrix framework to improve the availability and stability of the system.
  • In 2012, the Hystrix area was relatively mature and stable. Other teams also began to use Hystrix.
  • In November 2018, Hystrix announced on its Github homepage that it would no longer release new features and recommended developers to use other open source projects that are still active. However, Hystrix is ​​still very valuable and powerful, and is used by many first-tier Internet companies in China.

Hystrix Design Philosophy

  • Blocking the avalanche effect of services.
  • Fail fast and recover fast.
  • Graceful degradation.
  • Use resource isolation techniques such as bulkhead, swimlane, and circuit breaker.
  • Near real-time monitoring, alarming, and maintenance operations.

Hystrix thread pool isolation technology

Using thread pool isolation, for example, there are 3 services A, B, and C, and each service's thread pool is allocated 10, 20, and 30 threads. When all 10 threads in the thread pool of service A are used, if the number of requests for calling service A increases, it is not possible to add more threads, because the threads allocated to service A have been used up, and threads from other services will not be taken, so other services will not be affected. Hystrix uses thread pool isolation mode by default.

Thread pool isolation technology

Advantages of thread pool isolation technology

  • The dependent services have isolated thread pools. Even if their own thread pools are full, it will not affect any other service calls.
  • The health status of the thread pool is reported, and the call configuration of dependent services can be modified in near real time.
  • The thread pool has asynchronous characteristics and can build an asynchronous call layer.
  • It has a timeout detection mechanism, which is particularly useful for calls between services.

Disadvantages of thread pool isolation technology

  • The thread pool itself will bring some problems, such as thread switching and thread management, which will undoubtedly increase the CPU overhead.
  • If the thread utilization in the thread pool is very low, it is undoubtedly a waste.

Hystrix semaphore isolation technology

As shown in the figure below: To put it simply, there is a certain number of semaphores in a pool. Before service A calls service B each time, it needs to apply for a semaphore from the pool. Only after the semaphore is obtained can it call service B.

Get semaphore

Comparison of thread pool isolation and semaphore scenarios

  • Thread pool isolation technology is suitable for most scenarios, but it requires setting a service timeout.
  • Semaphore isolation technology is suitable for more complex internal businesses and does not involve network request issues.

3. Sentinel

3.1 What is Sentinel

Sentinel: A traffic control component for distributed service architecture. It mainly uses traffic as the entry point to help developers ensure the stability of microservices from multiple dimensions such as current limiting, traffic shaping, circuit breaking and degradation, system load protection, and hotspot protection.

3.2 History of Sentinel

  • In 2012, Sentinel was born, with the main function of ingress traffic control.
  • From 2013 to 2017, Sentinel developed rapidly within the Alibaba Group and became a basic technology module, covering all core scenarios. As a result, Sentinel has accumulated a large number of traffic aggregation scenarios and production practices.
  • In 2018, Sentinel was open sourced and continues to evolve.
  • In 2019, Sentinel continued to explore the direction of multi-language expansion and launched a C++ native version. At the same time, it also launched Envoy cluster traffic control support for Service Mesh scenarios to solve the problem of multi-language current limiting under the Service Mesh architecture.
  • In 2020, Sentinel Go version was launched, continuing its evolution towards cloud native.

3.3. Features of Sentinel

  • Rich application scenarios. Support Alibaba's Double 11 core scenarios, such as flash sales, message peak shifting, cluster flow control, and real-time circuit breaker for downstream unavailability.
  • Complete real-time monitoring. You can see the second-level data of a single machine connected to the application, as well as the summary of the cluster.
  • Extensive open source ecosystem. Spring Cloud, Dubbo, and gRPC can all be connected to Sentinel.
  • Complete SPI extension points. Implement extension interfaces to quickly customize logic.

To summarize with a picture:

Key Features of Sentilel

3.4 Sentinel Composition

  • The core library (Java client) does not depend on any framework/library, can run in all Java runtime environments, and has good support for frameworks such as Spring Cloud and Dubbo.
  • The Dashboard is developed based on Spring Boot and can be run directly after packaging without the need for additional application containers such as Tomcat.

3.5 Sentinel Resources

A resource in Sentinel is a core concept and can be anything in a Java application, including a provided service or even a piece of code.

Code that can be defined through the Sentinel API is a resource that can be protected by Sentinel. Resources can be identified in the following ways:

  • Method signature.
  • URL.
  • Service name, etc.

3.6 Sentinel’s Design Concept

As a traffic controller, Sentinel can adjust random requests into appropriate shapes as needed, as shown in the following figure:

Traffic Shaping

4. Comparison

4.1. Comparison of Isolation Design

Hystrix

Hystrix provides two isolation strategies, thread pool isolation and semaphore isolation.

The most recommended and commonly used one in Hystrix is ​​thread pool isolation. The advantage of thread pool isolation is that it has a high degree of isolation and will not affect other resources, but threads themselves have their own problems. Thread context switching consumes more CPU resources. If the requirement for low latency is relatively high, the impact is still quite large. In addition, creating threads requires memory allocation. The more threads are created, the more memory needs to be allocated. And if a thread pool is created for each resource, thread switching will bring greater losses.

The semaphore isolation of Hystrix can limit the number of concurrent calls to a resource. It is lightweight and does not require explicit creation of a thread pool. However, its disadvantage is that it cannot automatically downgrade slow calls and can only wait for the client to time out. Cascade blocking may still occur.

Sentinel

Sentinel can provide semaphore isolation through flow control in the concurrent thread number mode, and it also has a circuit breaker degradation mode for response time to prevent too many slow calls from filling up the concurrency number and affecting the entire system.

4.2. Comparison of Circuit Breaker and Downgrade

Sentinel and Hystrix are both based on the circuit breaker mode. Both support circuit breaking based on the exception ratio, but Sentinel is more powerful and can perform circuit breaking and downgrading based on response time, exception ratio, and number of exceptions.

4.3 Comparison of real-time statistics

Sentinel and Hystrix both perform real-time statistics based on sliding windows, but Hystrix is ​​an event-driven model based on RxJava, publishing response events when a service call succeeds/failed/timed out, and finally obtains a real-time indicator statistics stream through a series of transformations and aggregations, which can be consumed by fuses or dashboards. Sentinel is based on the sliding window of LeapArray.

5. Outstanding Features of Sentinel

In addition to the three major comparisons mentioned above, Sentinel has some features that Hystrix does not have.

5.1 Flow Control

Traffic control: The principle is to monitor indicators such as QPS or number of concurrent threads of application traffic, and control the traffic when the specified threshold is reached to avoid being overwhelmed by instantaneous traffic peaks, thereby ensuring high availability of the application.

Sentinel can perform flow control based on QPS/concurrency or based on call relationships.

There are several ways to control traffic based on QPS:

  • Direct rejection: When the QPS exceeds a certain threshold, it is directly rejected. It is suitable for situations where the system processing capacity is known.
  • Slow start warm-up: When the system is at a low water level for a long time, when the traffic suddenly increases, directly raising the system to a high water level may instantly overwhelm the system. Through "cold start", the traffic passing through is slowly increased and gradually increases to the upper threshold within a certain period of time, giving the cold system time to warm up and avoid the cold system being overwhelmed.

Slow start preheating mode schematic

  • Uniform queuing: Requests pass through at a uniform speed, which corresponds to the leaky bucket algorithm.

Schematic diagram of uniform queuing mode

  • Flow control based on call relationship:
  • Limit the flow based on the caller.
  • Limit the flow based on the call link entry: link flow limit.

Limit traffic based on related resource traffic: associated traffic limiting.

5.2 System Adaptive Current Limiting

Sentinel system adaptive current limiting controls the application inlet traffic from an overall dimension. With the help of TCP BBR ideas, combined with monitoring indicators of several dimensions such as application load, CPU utilization, overall average RT, inlet QPS and number of concurrent threads, the adaptive flow control strategy is used to balance the system's inlet traffic and system load, so that the system can run at the maximum throughput as much as possible while ensuring the overall stability of the system.

Adaptive current limiting principle diagram

We imagine the process of the system processing requests as a water pipe. The incoming requests are like water filling the water pipe. When the system processes smoothly, the requests do not need to queue up and pass directly through the water pipe. The RT of this request is the shortest. On the contrary, when the requests accumulate, the time to process the requests becomes: queuing time + shortest processing time.

Corollary 1: If we can ensure the amount of water in the pipe and allow the water to flow smoothly, the number of queued requests will not increase; that is, the system load at this time will not deteriorate further.

Corollary 2: When the inlet flow rate is kept at the maximum value of the flow rate out of the water pipe, the processing capacity of the water pipe can be maximized.

5.3. Real-time monitoring and control panel

Sentinel provides a lightweight open source console that provides machine discovery and health management, monitoring (standalone and cluster), rule management and push capabilities.

Sentinel Console

5.4 Development and Ecosystem

Sentinel is adapted to Spring Cloud, Dubbo, and gRPC. You can quickly access Sentinel by introducing dependencies and simple configuration. I believe Sentinel will be a powerful tool for future traffic control. I am optimistic about Sentinel.

5.5. Comparison and summary of Sentinel and Hystrix

Comparison between Hystix and Sentinel

Final Thoughts

Some readers asked me how to design a flash sale system. In previous articles, I have revealed the architectural design of the flash sale system. Now I will summarize the eight major points of the flash sale system:

  • Single responsibility service, independent deployment
  • Inventory preheating and quick deduction
  • Seckill link encryption
  • Separation of static and dynamic
  • Malicious request interception
  • Traffic peak shifting
  • Current Limitation, Circuit Breaking, and Downgrade
  • Queue Peak Shaving

This article is reprinted from the WeChat public account "Wukong Chats about Architecture", which can be followed through the following QR code. To reprint this article, please contact the WeChat public account "Wukong Chats about Architecture".

<<:  Don’t worry, tomorrow’s 5G may be “today’s high-speed rail”

>>:  Kunpeng spreads its wings in Guangdong and the Bay Area | Kunpeng and his friends propose new computing to empower government smart office

Recommend

How to break the 100-meter transmission distance limit?

Local Area Networks (LANs) have historically been...

4G loopholes cannot be plugged and 5G cannot be the savior

Two American universities have discovered a large...

Ranking of JavaScript open source projects in September

[[428048]] The ranking of the most popular JavaSc...

Fairytale Town: $4.19/month KVM-1GB/10G SSD/1TB/Japan Data Center

Fairytale Town is a Chinese hosting company estab...

Before 5G mobile phones become popular, these problems must be solved first

Although information about 5G has attracted a lot...

Five pictures to solve FTP

FTP Principle and Configuration FTP is a protocol...

Huawei Software Development Cloud helps improve WeChat mini-program code quality

In the early morning of January 9, after more tha...

Why you don't understand HTTPS

I wrote an article about HTTPS the day before yes...