20 billion daily traffic, Ctrip gateway architecture designThe author of the solution: Butters, a software technology expert at Ctrip, specializes in network architecture, API gateway, load balancing, Service Mesh and other fields. 1. OverviewSimilar to the practices of many companies, Ctrip API Gateway is an infrastructure introduced along with the microservice architecture, and its initial version was released in 2014. With the rapid advancement of service-orientedness within the company, the gateway has gradually become a standard solution for exposing applications to the external network. In subsequent projects such as "ALL IN Wireless", internationalization, and multi-site active-active, the gateway has continued to develop with the joint evolution of the company's public business and infrastructure. As of July 2021, the total number of access services exceeded 3,000, and the average daily traffic processed reached 20 billion. In terms of technical solutions, the early development of the company's microservices was deeply influenced by NetflixOSS. The gateway part was also first developed with reference to Zuul 1.0. The core can be summarized in the following four points:
picture As we all know, synchronous calls will block threads, and the system throughput is greatly affected by IO. As an industry leader, Zuul has taken this issue into consideration when designing: by introducing Hystrix, resource isolation and current limiting are achieved, and failures (slow IO) are limited to a certain range; combined with the circuit breaker strategy, some thread resources can be released in advance; ultimately achieving the goal of local anomalies not affecting the overall situation. However, as the company's business continued to develop, the effectiveness of the above strategy gradually weakened, mainly due to two reasons:
picture Fully asynchronous transformation is a core task of Ctrip's API gateway in recent years. This article will also revolve around this and explore our work and practical experience in the gateway. The key points include: performance optimization, business model, technical architecture, governance experience, etc. 2. High-performance gateway core design2.1. Asynchronous process designFull asynchrony = server-side asynchrony + business process asynchrony + client-side asynchrony For the server and client, we use the Netty framework, whose NIO/Epoll + Eventloop is essentially an event-driven design. The core part of our transformation is to make the business process asynchronous. Common asynchronous scenarios include:
From experience, asynchronous programming is slightly more difficult to design and read and write than synchronous programming, mainly including:
Especially in the Netty context, if the lifecycle of ByteBuf is not well designed, it is easy to cause memory leaks. To address these issues, we have designed corresponding peripheral frameworks, making the greatest effort to smooth out the synchronous/asynchronous differences in business codes to facilitate development; at the same time, we provide default protection and fault tolerance to ensure the overall security of the program. In terms of tools, we used RxJava, and its main process is shown in the figure below. picture
2.2. Streaming forwarding & single threadTaking HTTP as an example, the message can be divided into three parts: initial line/header/body. picture At Ctrip, the gateway layer service does not involve the request body. Because there is no need to store the entire data, you can directly enter the business process after parsing the request header. At the same time, if the request body is received: ① If the request has been forwarded to upstream, forward it directly; ② Otherwise, it needs to be temporarily stored and sent together with the initial line/header after the business process is completed; ③The same goes for handling upstream responses. Compared with the complete parsing of HTTP messages, it is handled like this:
Although performance has been improved, stream processing also greatly increases the complexity of the entire process. picture In non-streaming scenarios, the Netty Server encoding and decoding, inbound business logic, Netty Client encoding and decoding, and outbound business logic are all independent sub-processes that process complete HTTP objects. However, after adopting streaming processing, requests may be in multiple processes at the same time, which brings the following three challenges:
To address these challenges, we adopted a single-threaded approach. The core design includes:
The single-threaded approach avoids concurrency issues. When dealing with multi-stage linkage and edge scenario issues, the entire system is in a certain state, effectively reducing development difficulty and risk. In addition, reducing thread switching can also improve performance to a certain extent. However, due to the small number of worker threads (generally equal to the number of CPU cores), IO operations must be completely avoided in the eventloop, otherwise it will have a significant impact on the system throughput. 2.3 Other Optimizations
For the cookie/query and other fields of the request, if it is not necessary, do not perform string parsing in advance
Combined with the previous streaming forwarding design, the system memory usage can be further reduced.
Since the project upgraded to TLSv1.3, JDK11 was introduced (JDK8 was supported later, version 8u261, 2020.7.14), and a new generation of garbage collection algorithms were also tried, and their actual performance was indeed as good as expected. Although the CPU usage increased, the overall GC time consumption decreased significantly. picture picture
Due to the long history and openness of the HTTP protocol, many "bad practices" have emerged, which may affect the success rate of requests or even pose a threat to website security.
For problems such as request body too large (413), URI too long (414), non-ASCII characters (400), etc., general web servers will choose to directly reject and return the corresponding status code. Since these problems skip the business process, they will cause some trouble in statistics, service location and troubleshooting. By extending the codec, problematic requests can also complete the routing process, which helps solve the management problem of non-standard traffic.
For example, request smuggling (fixed in Netty 4.1.61.Final, released on March 30, 2021). By extending the codec and adding custom validation logic, security patches can be applied faster. 3. Gateway Service ModelAs an independent and unified inbound traffic entry point, the value of the gateway to the enterprise is mainly reflected in the following three aspects:
picture picture
Here are a few detailed scenarios:
In the closed client (APP), the framework layer will intercept the HTTP request initiated by the user and transmit it to the server through a private protocol (SOTP). In terms of site selection: ① Allocate IP through the server to prevent DNS hijacking; ② Preheat the connection; ③ Adopt a customized site selection strategy, which can be switched automatically according to network conditions, environment and other factors. In terms of interaction mode: ① Use a lighter protocol body; ② Perform encryption, compression and multiplexing in a unified manner; ③ The gateway converts the protocol uniformly at the entrance without affecting the business.
The key is to introduce the access layer to allow remote users to access nearby and solve the problem of excessive handshake overhead. At the same time, since both the access layer and the IDC are controllable, there is greater room for optimization in terms of network link selection, protocol interaction mode, etc.
Different from the proportional allocation and nearest access strategies, in the multi-site active-active mode, the gateway (access layer) needs to perform traffic diversion based on the shardingKey of the business dimension (such as userId) to prevent underlying data conflicts. picture 4. Gateway GovernanceThe following diagram summarizes the working status of the online gateway. The vertical corresponds to our business process: the traffic from various channels (such as APP, H5, mini-programs, suppliers) and various protocols (such as HTTP, SOTP) is distributed to the gateway through load balancing, and after a series of business logic processing, it is finally forwarded to the backend service. After the improvements in Chapter 2, the horizontal business has been significantly improved in terms of performance and stability. picture On the other hand, due to the existence of multiple channels/protocols, online gateways are deployed in independent clusters according to the business. In the early days, business differences (such as routing data, functional modules) were managed through independent code branches, but as the number of branches increased, the complexity of overall operation and maintenance continued to increase. In system design, complexity usually also means risk. Therefore, how to uniformly manage multi-protocol and multi-role gateways, and how to quickly build customized gateways for new businesses at a lower cost, became the focus of our next stage of work. The solution has been intuitively presented in the figure. The first is to perform compatibility processing on the protocol so that the online code can run under one framework; the second is to introduce a control plane to uniformly manage the different characteristics of online gateways. picture 4.1 Multi-protocol compatibilityThe multi-protocol compatibility method is not new. You can refer to Tomcat's abstract processing of HTTP/1.0, HTTP/1.1, and HTTP/2.0. Although HTTP has added many new features in each version, we usually cannot perceive these changes when developing business. The key lies in the abstraction of the HttpServletRequest interface. At Ctrip, the online gateway processes stateless protocols in request-response mode, and the message structure can also be divided into three parts: metadata, extension header, and business message, so similar attempts can be easily carried out. The related work can be summarized in the following two points:
picture 4.2 Routing ModuleThe routing module is one of the two main components of the control plane. In addition to managing the mapping relationship between gateways and services, the service itself can be summarized by the following model: 4.3 Module ArrangementModule scheduling is another key component of the control plane. We have set up multiple stages in the gateway processing flow (indicated by pink in the figure). In addition to common functions such as circuit breaking, current limiting, and logging, the business functions that need to be executed by different gateways are uniformly allocated by the control plane during runtime. These functions have independent code modules inside the gateway, and the control plane additionally defines the execution conditions, parameters, grayscale ratios, and error handling methods corresponding to these functions. This scheduling method also ensures the decoupling between modules to a certain extent. picture V. ConclusionGateway has always been a hot topic on various technical exchange platforms, and there are many mature solutions: the easy-to-use and early-developed Zuul 1.0, the high-performance Nginx, the highly integrated Spring Cloud Gateway, the increasingly popular Istio, and so on. The final selection still depends on the business background and technical ecology of each company. Therefore, at Ctrip, we chose the path of independent research and development. Technology is constantly evolving, and we are also continuing to explore, including the relationship between public gateways and business gateways, the application of new protocols (such as HTTP3), the association with ServiceMesh, and so on. |
<<: H3C Ao Xiangqiao: SD-WAN will eventually move towards a high-level self-intelligent network
[Shenzhen, China, April 21, 2023] Recently, the 2...
HostingViet is a local Vietnamese host establishe...
On September 6, at the "HUAWEI CONNECT 2017&...
UCloud (UCloud Technology Co., Ltd.) is a listed ...
The arrival of the New Year is always exciting, e...
[[179940]] In 2017, the capital expenditure of gl...
The three major operators have already commercial...
[51CTO.com Quick Translation] As a data analyst, ...
[51CTO.com original article] As soon as I walked ...
[51CTO.com original article] Only after careful c...
Cellular has ‘all the ingredients’ to enhance pre...
spinservers launched a new VPS host product this ...
On February 1, the Ministry of Industry and Infor...
As a new type of mobile communication network, 5G...
[[274234]] Recently, Canadian media reported that...