"There are few things in the world that can be called natural and self-evident, and the same is true for the complex Internet architecture. A tall building is built from the ground up, and the architecture is the result of evolution. So what is the essence of evolution?" — 1 — Introduction Software complexity comes from several aspects: high concurrency, high performance, high availability, scalability, low cost, low scale, maintainability, security, etc. Architecture evolution and development are all attempts to reduce complexity:
This article mainly analyzes the key points solved in the evolution of network application architecture from the perspective of how to achieve high concurrency and high performance systems, and finds some rules. It can also guide us on which links we should pay attention to when building high concurrency and high performance systems.
Note: Too many call links will cause significant performance loss. ... ... Due to limited space, this article will not go into all the details. — 2 — Start with a network connection The browser/app generally uses http or https protocols to communicate with the backend, and the underlying layer uses TCP (Transmission Control Protocol), while RPC remote calls can directly use TCP connections. Let's start the article with TCP connections.
Both parties can actively initiate a disconnection, initiate and reply four times in total (with status in between), and then close the connection. Note: For more details, please refer to the relevant documents. Both Windows and Linux servers can use the netstat -an command to view. In network programming, we generally focus on the following indicators regarding connections: 1. Connection related How many client connections the server can maintain, manage, and handle.
Regarding the number of TCP connections, in Linux, it is related to the file handle description item, which can be viewed and modified by ulimit -n. Others are related to hardware resources such as CPU, memory, and network bandwidth. A single machine can achieve hundreds of thousands of concurrent connections. How is this achieved? We will explain this later when discussing the IO model. 2. Traffic related Mainly the configuration of network bandwidth.
3. Number of packets The data packet is the content encapsulation transmitted after the TCP three-way handshake establishes a connection.
For details about TCP/IP packets, please refer to the relevant documents. However, one thing must be noted: our single request may be divided into multiple packets for sending, and the network middleware will handle the unpacking and sticking of packets for us (such as message completion, carriage return ending, custom message header, custom protocol and other solutions). If the user data we transmit is small, then the efficiency will definitely increase. On the other hand, if the size of the compressed transmission packet is unlimited, the decompression will also consume CPU resources, and a balanced process is required. 4. Application Transport Protocol The transmission protocol has good compression rate, good transmission performance, and high improvement on concurrent performance. However, it also depends on whether the languages of the calling parties can use the protocol. You can define it yourself, or use a mature transmission protocol. For example, redis serialization transmission protocol, json transmission protocol, protocol buffers transmission protocol, http protocol, etc. Especially in the process of RPC calling, the transmission protocol selection needs to be carefully selected. 5. Long and short connections
Selection suggestion: In scenarios with a small number of clients, long connections are generally used. Long connections are best used for communication between backend middleware and microservices, such as database connections, duboo default protocols, etc. Large web and app applications use short HTTP connections (http1.1's keep-alive supports long connections in disguise, but it is still a serial request/response interaction). http2.0 supports real long connections. Persistent connections consume more resources on the server side. With millions of users, each user has a dedicated connection, which puts a lot of pressure on the server side and increases the cost. IM and push applications use persistent connections, but they will do a lot of optimization work. Since https requires encryption and decryption operations, it is best to use http2.0 (mandatory SSL), which has good transmission performance. However, the server needs to maintain more connections. 6. Concurrent connections and concurrency
Comprehensive consideration is required from the overall system level, each individual service level, and each method within the service. Here are some examples:
The overall system throughput, RT response time, and supported concurrency are composed of small operations and microservices. Each microservice and operation also needs to be evaluated separately. After a balanced combination, the overall system indicators are formed. 7. Bars First, let's look at the typical process of a typical Internet server processing a network request: Note: Regarding the conversion of user-mode and kernel-mode data, in some special scenarios, middleware such as Kafka can use zero copy technology to avoid the overhead of switching between the two states. a. The three steps (1, 2, 3) represent the client network request, establishing a connection (managing the connection), sending a request, and the server receiving the request data. b. (4) Build a response. Process the client's request in user space and build a response. c. (5, 6, 7) The server sends the response to the client through the fd connection in a. The above can be divided into two key points:
Network applications should consider balancing a+c and b, and balancing the ability to handle these connections with the connection requests that can be managed. For example, an application has hundreds of thousands of concurrent connections, and these connections make about 20,000 requests per second. It needs to manage 100,000 connections and handle 20,000 requests per second to achieve balance. How to achieve high QPS? There are two directions:
Note: Generally, the system management connection capacity is much greater than the processing capacity. As shown in the figure above, the client's requests will form a large queue; the server will process the tasks in this large queue. How big this queue can be depends on the connection management capability; how to ensure the balance between the rate of tasks entering the queue and the speed of processing and removing tasks is the key. Achieving balance is the goal. — 3 — Common IO models in network programming The interaction between the client and the server will generate a connection, which is reflected by the file description item fd on the server side in Linux, the socket connection in socket programming, and the channel in the Java language API. The IO model can be understood as a mechanism for managing fd, reading data from the client through fd (client request) and writing data to the client through fd (responding to the client). Regarding synchronous, asynchronous, blocking, and non-blocking IO operations, the descriptions on the Internet and in books are different, and there is no accurate description. We follow the sixth chapter of "UNIX Network Programming: Volume 1" - I/O Multiplexing as the standard. The book mentions 5 types of I/O models available under UNIX: blocking I/O, non-blocking I/O, I/O multiplexing (selece, poll, epoll), signal-driven I/O, and asynchronous I/O. (For details, please refer to relevant books and materials) 1. Blocking I/O: The process will be stuck in the call of recvfrom, waiting for the final result data to be returned. It is definitely synchronous. 2. Non-blocking I/O: The process repeatedly calls recvfrom in a round-robin fashion until the final result data is returned. This is also a synchronous call, but the IO kernel does not block the call. This is not very practical, so we will not discuss its application. 3. I/O multiplexing is also synchronization: the process is stuck on the select and epoll calls, and will not be stuck on recvfrom until the final result is returned. Note: select model: put the fd to be managed into a data and loop this data. The array size is 1024, and the manageable connections are limited. poll is similar to select, but the array type is changed to a linked list, and there is no 1024 size limit. Epoll is an event poll, which only manages fds with events, that is, it only processes active connections. Epoll implements message passing by sharing a mmap() file mapping memory between the kernel and user space. See http://libevent.org/ 4. Signal-driven I/O: also synchronous. Only implemented in Unix, so it is not discussed here. 5. Asynchronous: Only asynchronous I/O is asynchronous. The underlying operating system only implements it in Window, so it is not discussed here. Nodejs middleware implements it through callbacks, and Java AIO also implements it. It is difficult to develop. The difference between synchronous/asynchronous, blocking/non-blocking in the IO model (easy to understand):
In our usual programming and function interface calling process, except for timeout, a result will be returned. Synchronous and asynchronous calls are divided into the following:
Note: Don't confuse the select keyword!!! There are many technical implementations of IO multiplexing: select, poll, epoll. Please refer to the materials for details. Almost all middleware will use the epoll mode. In addition, due to the different multiplexing implementation mechanisms of various operating systems, epoll, kqueue, and IOCP interfaces have their own characteristics. Third-party libraries encapsulate these differences and provide a unified API, such as Libevent. In addition, Java language and netty provide higher-level encapsulation. JavaNIO and netty retain the select method, which also causes some confusion. Section: Currently, network middleware uses blocking IO and IO multiplexing models to manage connections and obtain data through network IO. The next section explains some middleware cases that use the IO model. — 4 — Specific implementation model of synchronous blocking IO model-PPC, TPC From the perspective of pure network programming technology, there are two main ideas for server data processing:
Each process/thread handles one connection, which is called PPC or TPC. PPC stands for Process Per Connection and TPC stands for Thread Per Connection. The network server implemented by the traditional blocking IO model adopts this mode. Note: close refers specifically to the main process's count of connections, and the connection is actually closed in the child process. In multi-threaded implementation, the main thread does not need a close operation because the parent and child threads share storage. For example: jmm in java Note: In pre mode, threads and processes are created in advance, and connections are assigned to pre-created threads or processes. There is a herd phenomenon when there are multiple processes. Applying for threads or processes will take up a lot of system resources. The operating system has limited CPU and memory, and the number of threads that can be managed at the same time is limited. There cannot be too many threads for processing connections. Although you can create processes or threads in advance to process data (prefork/prethead) or use thread pools to reduce thread creation pressure, the size of the thread pool is a ceiling. In addition, parent-child process communication is also more complicated. Apache MPM prefork (ppc) can support 256 concurrent connections, and Tomcat synchronous IO (tpc) works in blocking IO mode and can support 500 concurrent connections. Java can create a thread pool to reduce the resource overhead of creating certain threads for processing. The network connection fd can support tens of thousands, but each thread needs to occupy system memory, and the total number of threads that can exist at the same time is limited. In Linux, you can use the command ulimit -s to view the stack memory allocation. More threads will increase the resource scheduling overhead of the CPU. How to solve the imbalance? Section: The bottleneck of ppc and tpc is the small number of connections that can be managed. The multi-threaded processing business capability was sufficient, but now it is bound to fd, and the thread life cycle is the same as fd, which limits the thread processing capability. Split: Split the fd life cycle from the thread life cycle. — 5 — The specific implementation model of the IO model - Reactor Each process/thread handles multiple connections at the same time (IO multiplexing), and multiple connections share a blocking object. The application only needs to wait on a blocking object, without blocking and waiting for all connections. When a connection has new data to process, the operating system notifies the application, the thread returns from the blocking state (there are better optimizations, see the next section), and starts business processing; this is the idea of the Reactor mode. Reactor mode refers to an event-driven processing mode for service requests that are simultaneously passed to the service processor through one or more inputs. The server program processes multiple requests from the client and dispatches them synchronously to the processing threads corresponding to the requests. Reactor mode is also called Dispatcher mode. That is, I/O has multiplexed unified listening events, and after receiving the events, they are distributed (Dispatch to a certain process). It is one of the necessary technologies for writing high-performance network servers. Many excellent network middleware are based on this idea. Note: Since epoll manages a much larger number of connections than select, the underlying implementations in frameworks such as libevent and netty are all epoll-based, but the programming API retains the select keyword. Therefore, epoll_wait is equivalent to select in this article. The Reactor pattern has several key components:
For IO-bound scenarios, you can use Reactor scenarios, but ThreadLocal cannot be used. Development and debugging are difficult, so it is generally not recommended to implement it yourself. You can use the existing framework. Section: Reactor can increase the number of manageable network connections to hundreds of thousands. However, the request tasks on so many connections still need to be handled through multi-threading and multi-process mechanisms. Even the load is forwarded to other servers for processing. — 6 — Reactor pattern practice case (C language) Through the examples of several open source frameworks, we can understand the network frameworks in different scenarios, how to use the Reactor mode, and what detailed adjustments have been made. Note: The actual implementation will certainly be very different from the figure. The client io and send are relatively simple and are omitted in the figure. A. Single Reactor + single thread processing (one thread overall) represented by redis As shown in the figure:
Convert the request into a command queue and process it in a single process. Note that the queue in the figure is processed by a single thread and there is no competition. advantage: The model is simple. This model is the simplest, easy to implement in code, and suitable for computationally intensive applications. No need to consider concurrency issues. The model itself is single-threaded, which makes the main logic of the service also single-threaded, so there is no need to consider many concurrency issues, such as locks and synchronization. Suitable for short-term services. For redis, which basically checks the memory for each event, it is very suitable. First, the concurrency is acceptable, and second, many data structures in redis are very simple to implement. shortcoming: Performance issues: With only one thread, the performance of a multi-core CPU cannot be fully utilized. Sequential execution affects subsequent events. Because all processing is performed sequentially, if faced with a long-lasting event, all subsequent tasks will be delayed, which is unbearable especially for IO-intensive applications. This is why redis prohibits everyone from using time-consuming commands Note: redis implements its own IO multiplexing, does not use libevent, the implementation does not match the diagram, and is more lightweight. This model can handle read and write event operations in a very short time, and can achieve a throughput of about 100,000 QPS (the various redis commands vary greatly). Note: The redis release version comes with the redis-benchmark performance testing tool, which can be used to calculate qps. Example: Use 50 concurrent connections, issue 100,000 requests, each request has 2 KB of data, and test the performance of the redis server with host 127.0.0.1 and port 6379: ./redis-benchmark -h127.0.0.1 -p 6379 -c 50 -n 100000 -d 2 For network systems with a large number of clients, the emphasis is on multiple clients, that is, the number of concurrent connections. For network systems with a small number of backend connections, long connections are used, with a small number of concurrent connections, but a large number of requests initiated by each connection. B. Single Reactor + Single Queue + Business Thread Pool As shown in the figure, we separate the real business processing from the Reactor thread and implement it through the business thread pool. So how does the Handler object of each fd in the Reactor communicate with the Worker thread pool? Through the pending request queue. The client's request to the server can be imagined as a request queue IO. After being processed by the Reactor (multiplexing), it is (split) into a queue of pending work tasks. Note: There are splits everywhere! The business thread pool allocates an independent thread pool, gets data from the queue for real business processing, and returns the result to the Handler. After the Handler receives the response result, it sends the result to the client. Compared with model A, the thread pool technology is used to speed up the client request processing capability. For example, the nonblocking server in thrift 0.10.0 uses this model and can reach tens of thousands of QPS. Disadvantages: The disadvantage of this model lies in the queue, which is a performance bottleneck. The thread pool needs to lock when getting tasks from the queue, and a high-performance read-write lock will be used to implement the queue. C. Single Reactor + N Queues + N Threads This model is a variant of A and B, and memcached adopts this model. The waiting queues are divided into multiple ones, and each queue is bound to a thread for processing. This maximizes the management of network connections by IO multiplexing and releases the bottleneck caused by a single queue. The QPS is estimated to reach 200,000. However, this solution has a big disadvantage. Load balancing may cause some queues to be busy and some to be idle. Fortunately, memcached is also an in-memory operation and is not very sensitive to load issues, so this model can be used. D. Single-process Reactor monitoring + N-process (accept+epoll_wait+processing) model process: The master (Reactor main process) process listens for new connections and asks one of the worker processes to accept. Here we need to deal with the thundering herd effect problem, see the accept_mutex design of nginx for details After the worker (subReactor process) process accepts fd, it registers fd to the epoll handle of this process, and this process handles the subsequent read and write events of this fd The worker process selectively does not accept new fds based on its own load conditions, thereby achieving load balancing. advantage: The process hang will not affect this service The worker actively implements load balancing, which is simpler than the master. shortcoming: Multi-process model programming is more complicated, and inter-process synchronization is not as simple as threads. Processes have more overhead than threads Nginx uses this model. Since Nginx mainly provides reverse proxy and static content web service functions, the QPS indicator is related to the processing server proxied by Nginx. Note: The nodejs multi-process deployment method is similar to the nginx method. Section: We hope to find out which problems the split solves and which problems it causes from these Reactor examples. — 7 — Reactor pattern practice case (Java language Netty) Netty is an asynchronous event-driven network application framework for rapid development of maintainable high-performance protocol servers and clients. Many open source network middlewares in Java use Netty. This article only describes the relevant parts for NIO multiplexing. For many unpacking and sticking, scheduled task heartbeat monitoring, serialization hooks, etc., please refer to the reference. As shown in the figure: Netty can be configured to determine in which thread (pool) each module runs: 1. Single Reactor Single Thread
2. Single Reactor Multi-threaded SubReactor
3. Single Reactor, multi-threaded subReactor, and designated thread pool to handle business https://netty.io/4.1/api/io/netty/channel/ChannelPipeline.html We define multiple ChannelHandlers in a pipeline to receive I/O events (e.g., read) and request I/O operations (e.g., write and close). For example, a typical server has the following Handlers in each channel pipeline: (depending on the complexity and characteristics of the protocol and business logic used):
As shown in the following example:
As mentioned in the previous article, web applications accept millions or tens of millions of network connections and manage them into requests and responses, just like a large queue. How to better handle the tasks in the queue involves a series of load issues such as load balancing, locking, blocking, thread pools, multi-processes, forwarding, synchronization and asynchrony. Both stand-alone and distributed machines need to be optimized, and Netty has made a lot of optimizations. This part of the Netty source code is not easy to understand: Business processing and IO task common thread pool Custom thread pool processing business As shown in the figure: In Netty, there is an unfixed number of channels, a fixed NioEventLoop, and an EventExecutor with an external thread pool. Driven by the irregular events of many channels, it is very complicated to coordinate threads. Here is a question: Why can spring webflux and nodejs based on netty support a large number of connections, but the CPU becomes a bottleneck? Section: In this way, we initiate a request from the client -> establish a connection to the server -> the server non-blocking listens for transmission -> business processing -> responds to the entire process. Through IO multiplexing, thread pool, and business thread pool, the entire processing chain has no processing bottlenecks or shortcomings, achieving overall high performance and high throughput. However, the time-consuming processing capability is far lower than the IO connection management capability, and a single machine will reach its ceiling. Further splitting (professional middleware does professional work) and RPC and microservice calls are the solution strategies. — 8 — Distributed remote calls (not the end but the beginning) As can be seen from the previous article, the ultimate bottleneck of a single machine will be business processing. For the Java language, the number of threads cannot be infinitely expanded. Even if the Go language's smaller-overhead coroutine is used, the CPU will become a bottleneck for a single machine. Therefore, distributed remote calls across machines are definitely the direction to solve the problem. There are already many practices in the industry. Let's look at the three typical architecture diagrams to see what problems the evolution solves and how it solves them: Note: This article does not discuss SOA, RPC, microservices, etc., but only focuses on the basis and goals of the split. A. Monolithic Application B. Separate network connection management and static content C. Business functional split A:Typical single application. A->B: Connection management and business processing are separated. Using nginx, which has powerful network connection management capabilities, business processing is separated into multiple machines. B->C: Business processing is split from the functional perspective. Some businesses focus on protocol analysis, some focus on business judgment, and some focus on database operations. Continue to split. Through Figure C, from the perspective of high performance, we can see the principles and points to note for service layering (there are many technical selections for each layer): 1. Reverse proxy layer (associated with https connection)
2. Gateway layer (general non-business operations) The reverse proxy layer connects to the gateway layer via the HTTP protocol, and the two communicate via the intranet IP, which is much more efficient. We assume that the gateway layer uses TCP long connections to the downstream, which can be achieved by RPC frameworks such as Dobbo in the Java language. The gateway layer mainly does several things:
The gateway layer can be implemented by open source Zuul, spring cloud gateway, nodejs, etc. Nginx can also be customized for gateway development and physically merged with the reverse proxy layer. 3. Business logic layer (business-level operations) From this layer, we can consider vertical stratification according to business logic. For example: user logic layer, order logic layer, etc. If split in this way, we may abstract a layer of business logic layer. We try to ensure that the business logic layer does not call horizontally, and only calls downstream from upstream.
4. Data access layer (database storage related operations)
Note: This section quotes some views from Mr. Sun Xuan's "Million-Year Salary Architect Course". I recommend this course, which covers everything from architectural practice, microservice implementation, service governance, and so on, from the essence to the actual combat. Below the gateway layer and above the database, the RPC middleware technology selection and technical indicators are as follows (from dubbo official website):
Sections: In the stand-alone era, each thread managed a network connection, and then through IO multiplexing, a single thread managed the network connection, freeing up resources to handle business. Then, the IO thread pool and the business thread pool were separated. You can find a pattern: the client connection request is the starting point, and the backend processing capacity is gradually balanced and strengthened. The business processing capacity is always behind the receiving processing capacity. Reverse proxy era: nginx can manage enough connections, and the backend can forward to N application servers tomcat. To some extent, more efficient use of resources, through the selection of hardware and software, the management connection (function) and the processing connection (function) are physically separated, and the software and hardware work together to handle what they are better at. SOA, microservice era: (The emergence of SOA is actually for low coupling, and has little to do with high performance and high concurrency) There are many types of business processing. Some are operation-intensive; some require operating databases; some only need to read some data from cache; some services are very high in usage; some use frequency is low. In order to make better use, there are two splitting mechanisms. The services that operate databases are separated (data access layer), and the business logic processing is split out (business logic layer); according to the above logic, it is possible that one nginx + 3 tomcat gateways + 5 duboo business logic + 10 duboo data access configuration is appropriate. The purpose of our configuration is that the exclusive services processed by each layer can suppress the server to 60% of the resource usage. Note: The article only focuses on horizontal hierarchy at the functional level. The vertical level also needs to be layered. For example: User management and order management are two different types of businesses, with different business technical characteristics and access frequency. The storage level also requires vertical library and table division. This article will be omitted for the time being. In the stand-alone stage, multi-threaded and multi-process are actually equivalent to a kind of vertical concurrent splitting, trying to ensure statelessness and avoid locks as much as possible, which is consistent with the principle of stateless and distributed locks of microservices. — 9 — Summarize Looking back at the previous article, what should the client do after connecting to the server? Is the performance bottleneck maintaining so many connections? Or is the processing for each connection not meeting the requirements imbalance? How to break the deadlock? From the description of the stand-alone internals and then to the physical machine splitting, there are three points and their importance: Focus on balance: Only when the architecture reaches balance can it be a high-performance, high-concurrency architecture. Any performance problem will be caused by a certain point. It even refers to the need to balance business needs and complexity. The way to split: the right thing to solve the problem with the right technology and the right middleware. Specific: How to split horizontally and vertically also requires analysis of the scenario. Understand business scenarios, the nature of problems & & Understand solutions in common scenarios: According to the ideas of discovering problems, analyzing problems, and solving problems, we have prepared all the ammunition databases, and the process of solving problems is a matching process. In addition to the technologies and splitting solutions mentioned in the article, many technical points can improve throughput and performance, as shown below:
|
<<: Top 9 bandwidth monitoring tools for enterprise networks
>>: Are you among the unemployed after 5G?
My hometown is in the rural area of Hebei. My b...
RAKsmart is a foreign hosting company operated by...
It was supposed to be a time to get rid of the ol...
In May 2015, the state released the "Made in...
Taking stock of the sources of growth in operator...
Edge computing is changing the way millions of de...
Today's networks are constantly changing and ...
There are fewer and fewer businesses that still u...
Changing the rules of the online market It’s safe...
"Knowledge is power." Intellectual prop...
Power over Ethernet standards have important diff...
Yecaoyun, a Chinese VPS host, has released a new ...
【51CTO.com Quick Translation】 Starting a new open...
[Original article from 51CTO.com] In the context ...
The use of edge computing in the enterprise will ...