What factors should be considered when designing a high-performance and high-concurrency network system? (10,000-word article)

What factors should be considered when designing a high-performance and high-concurrency network system? (10,000-word article)

[[271144]]

"There are few things in the world that can be called natural and self-evident, and the same is true for the complex Internet architecture. A tall building is built from the ground up, and the architecture is the result of evolution. So what is the essence of evolution?"

— 1 —

Introduction

Software complexity comes from several aspects: high concurrency, high performance, high availability, scalability, low cost, low scale, maintainability, security, etc. Architecture evolution and development are all attempts to reduce complexity:

  • High concurrency and high performance: Internet systems have the characteristics of large numbers of users and requests, so high concurrency and high performance are essential requirements. Poor performance will lead to a poor experience, and users will have other choices.
  • High availability: High system availability can improve user experience and has become a must-have requirement. More than a decade ago, we needed to do T+N operations to buy stocks, but now we can do it in real time through mobile phones.
  • Scalable and easy to iterate: In the early stage of the product, a single or simple architecture is adopted. In the mature stage, it evolves into the current concept of large and medium platforms and small front-ends, separating the unchanged from the changed. Product managers and architects need to avoid infinitely magnifying requirements and designing for the future, which will put them in an awkward situation.
  • Low cost: It is a process. The ROI ratio will decrease as time goes by.
  • Low scale: Small scale means low cost, and convenient operation, maintenance, and expansion. Therefore, the design principles of simple, applicable, and evolutionary architecture are very important.
  • Easy operation and maintenance: In addition to traditional operation and maintenance, the rapid development of business, grayscale release, rapid release rollback, partial function upgrade, ab testing, etc. put forward higher requirements on the architecture level, which is also one of the reasons why containerization technology is so popular now.

This article mainly analyzes the key points solved in the evolution of network application architecture from the perspective of how to achieve high concurrency and high performance systems, and finds some rules. It can also guide us on which links we should pay attention to when building high concurrency and high performance systems.

  • How to use stand-alone resources more effectively? What practices has open source software taken in high performance and high concurrency?
  • How to use cross-machine remote calls to improve concurrency and "performance" under the premise of high concurrency. How to split distributed services, and how to split them to achieve high performance and high availability without wasting resources?

Note: Too many call links will cause significant performance loss.

... ...

Due to limited space, this article will not go into all the details.

— 2 —

Start with a network connection

The browser/app generally uses http or https protocols to communicate with the backend, and the underlying layer uses TCP (Transmission Control Protocol), while RPC remote calls can directly use TCP connections. Let's start the article with TCP connections.

  • Everyone knows that TCP three-way handshake establishes a connection and four-way handshake disconnects the connection. The following is a brief description:
  • The client actively initiates the establishment of a connection. After three alternating interactions (with status in between), the status of both parties changes to ESTABLISHED and duplex data transmission can begin.

Both parties can actively initiate a disconnection, initiate and reply four times in total (with status in between), and then close the connection.

Note: For more details, please refer to the relevant documents. Both Windows and Linux servers can use the netstat -an command to view.

In network programming, we generally focus on the following indicators regarding connections:

1. Connection related

How many client connections the server can maintain, manage, and handle.

  • Active connections: All TCP connections in the ESTABLISHED state, at a certain moment, these connections are transmitting data. If you use a long connection, one connection will transmit multiple requests at the same time. You can also indirectly examine the concurrent processing capabilities of the backend service, which is different from the concurrency volume.
  • Inactive connections: indicates the number of TCP connections in all states except the ESTABLISHED state.
  • Concurrent connections: the number of all established TCP connections. = active connections + inactive connections.
  • Number of new connections: The average number of new connection requests from the client to the server during the statistical period. Mainly examines the ability to cope with sudden traffic or from normal to peak traffic. For example: flash sales and ticket grabbing scenarios.
  • Discarded connections: The number of connections discarded per second. If the connection server performs a connection fuse, this part of the data is the fuse connection.

Regarding the number of TCP connections, in Linux, it is related to the file handle description item, which can be viewed and modified by ulimit -n. Others are related to hardware resources such as CPU, memory, and network bandwidth. A single machine can achieve hundreds of thousands of concurrent connections. How is this achieved? We will explain this later when discussing the IO model.

2. Traffic related

Mainly the configuration of network bandwidth.

  • Inbound traffic: The traffic consumed by external access to the server.
  • Outgoing traffic: The traffic that the server responds to externally.

3. Number of packets

The data packet is the content encapsulation transmitted after the TCP three-way handshake establishes a connection.

  • Incoming packets: The number of request packets received by the server per second.
  • Outgoing packets: The number of packets sent by the server per second.

For details about TCP/IP packets, please refer to the relevant documents. However, one thing must be noted: our single request may be divided into multiple packets for sending, and the network middleware will handle the unpacking and sticking of packets for us (such as message completion, carriage return ending, custom message header, custom protocol and other solutions). If the user data we transmit is small, then the efficiency will definitely increase. On the other hand, if the size of the compressed transmission packet is unlimited, the decompression will also consume CPU resources, and a balanced process is required.

4. Application Transport Protocol

The transmission protocol has good compression rate, good transmission performance, and high improvement on concurrent performance. However, it also depends on whether the languages ​​of the calling parties can use the protocol. You can define it yourself, or use a mature transmission protocol. For example, redis serialization transmission protocol, json transmission protocol, protocol buffers transmission protocol, http protocol, etc. Especially in the process of RPC calling, the transmission protocol selection needs to be carefully selected.

5. Long and short connections

  • A long connection means that data packets can be reused multiple times on a TCP connection. If no data packets are sent during the TCP connection, both parties need to send detection packets to maintain the connection.
  • Handling of half-open connections: After a normal TCP connection is established between the client and the server, if the client host is disconnected (the network cable is disconnected), the power is lost, or the system crashes, the server will never know. The long connection middleware needs to handle this detail. The default configuration of Linux is 2 hours, which can be configured and modified.
  • A short connection means that when the two communicating parties have data exchange, a TCP connection is established, and after the data is sent, the TCP connection is disconnected. However, each connection establishment requires three handshakes, and each disconnection requires four handshakes.
  • It is best for the client to actively initiate the closing of the connection, and the TIME_WAIT state should not be on the server side to reduce resource usage.

Selection suggestion:

In scenarios with a small number of clients, long connections are generally used. Long connections are best used for communication between backend middleware and microservices, such as database connections, duboo default protocols, etc.

Large web and app applications use short HTTP connections (http1.1's keep-alive supports long connections in disguise, but it is still a serial request/response interaction). http2.0 supports real long connections.

Persistent connections consume more resources on the server side. With millions of users, each user has a dedicated connection, which puts a lot of pressure on the server side and increases the cost. IM and push applications use persistent connections, but they will do a lot of optimization work.

Since https requires encryption and decryption operations, it is best to use http2.0 (mandatory SSL), which has good transmission performance. However, the server needs to maintain more connections.

6. Concurrent connections and concurrency

  • Concurrent connections: = Active connections + Inactive connections. The number of all established TCP connections. The number of connections that a network server can manage in parallel.
  • Number of active connections: All TCP connections in ESTABLISHED state.
  • Concurrency: The amount of data transmitted instantly through active connections. This amount is usually easier to estimate on the processing side. It has no absolute relationship with the number of active connections. It refers to the number of business requests that a network server can process in parallel.
  • RT response time: The RT of each operation on a single machine is definitely different. For example, reading data from the cache and writing to the database in a distributed transaction consume different resources, and the operation time itself is different.
  • Throughput: QPS/TPS, the number of queries or transactions that can be processed per second, this is a key indicator.

Comprehensive consideration is required from the overall system level, each individual service level, and each method within the service.

Here are some examples:

  • Opening the product details page requires separation of dynamic and static operations. The subsequent series of dynamic services and cache mechanisms will shorten the overall RT itself, and the QPS that a single machine can support is higher. (There are also differences between services and methods)
  • However, order submission operations require distributed transactions, distributed locks, etc., the RT itself will be long, and the QPS supported by a single machine will be low.
  • Will we deploy more machines for the order submission service? The answer is not necessarily. Because the frequency of users browsing products is very high, while the frequency of submitting orders is very low. How to evaluate it correctly?
  • Services need to be classified into: critical services/non-critical services, and the QPS requirements of each service during peak hours to achieve balanced considerations.

The overall system throughput, RT response time, and supported concurrency are composed of small operations and microservices. Each microservice and operation also needs to be evaluated separately. After a balanced combination, the overall system indicators are formed.

7. Bars

First, let's look at the typical process of a typical Internet server processing a network request:

Note: Regarding the conversion of user-mode and kernel-mode data, in some special scenarios, middleware such as Kafka can use zero copy technology to avoid the overhead of switching between the two states.

a. The three steps (1, 2, 3) represent the client network request, establishing a connection (managing the connection), sending a request, and the server receiving the request data.

b. (4) Build a response. Process the client's request in user space and build a response.

c. (5, 6, 7) The server sends the response to the client through the fd connection in a.

The above can be divided into two key points:

  • a and c How the server manages the network connection, obtains input data from the client, and responds to the client.
  • bThe server processes the request.

Network applications should consider balancing a+c and b, and balancing the ability to handle these connections with the connection requests that can be managed.

For example, an application has hundreds of thousands of concurrent connections, and these connections make about 20,000 requests per second. It needs to manage 100,000 connections and handle 20,000 requests per second to achieve balance. How to achieve high QPS? There are two directions:

  • Single machine optimization (see the middleware example below)
  • Forward to multiple machines for processing (remote call)

Note: Generally, the system management connection capacity is much greater than the processing capacity.

As shown in the figure above, the client's requests will form a large queue; the server will process the tasks in this large queue. How big this queue can be depends on the connection management capability; how to ensure the balance between the rate of tasks entering the queue and the speed of processing and removing tasks is the key. Achieving balance is the goal.

— 3 —

Common IO models in network programming

The interaction between the client and the server will generate a connection, which is reflected by the file description item fd on the server side in Linux, the socket connection in socket programming, and the channel in the Java language API. The IO model can be understood as a mechanism for managing fd, reading data from the client through fd (client request) and writing data to the client through fd (responding to the client).

Regarding synchronous, asynchronous, blocking, and non-blocking IO operations, the descriptions on the Internet and in books are different, and there is no accurate description. We follow the sixth chapter of "UNIX Network Programming: Volume 1" - I/O Multiplexing as the standard. The book mentions 5 types of I/O models available under UNIX: blocking I/O, non-blocking I/O, I/O multiplexing (selece, poll, epoll), signal-driven I/O, and asynchronous I/O. (For details, please refer to relevant books and materials)

1. Blocking I/O: The process will be stuck in the call of recvfrom, waiting for the final result data to be returned. It is definitely synchronous.

2. Non-blocking I/O: The process repeatedly calls recvfrom in a round-robin fashion until the final result data is returned. This is also a synchronous call, but the IO kernel does not block the call. This is not very practical, so we will not discuss its application.

3. I/O multiplexing is also synchronization: the process is stuck on the select and epoll calls, and will not be stuck on recvfrom until the final result is returned.

Note: select model: put the fd to be managed into a data and loop this data. The array size is 1024, and the manageable connections are limited. poll is similar to select, but the array type is changed to a linked list, and there is no 1024 size limit.

Epoll is an event poll, which only manages fds with events, that is, it only processes active connections. Epoll implements message passing by sharing a mmap() file mapping memory between the kernel and user space. See http://libevent.org/

4. Signal-driven I/O: also synchronous. Only implemented in Unix, so it is not discussed here.

5. Asynchronous: Only asynchronous I/O is asynchronous. The underlying operating system only implements it in Window, so it is not discussed here. Nodejs middleware implements it through callbacks, and Java AIO also implements it. It is difficult to develop.

The difference between synchronous/asynchronous, blocking/non-blocking in the IO model (easy to understand):

  • Synchronous and asynchronous: the way to access data, synchronously requires active reading and writing of data, and requires the called party IO to return the final result. After the asynchronous request is issued, it only needs to wait for the notification of the completion of the IO operation, and does not actively read and write data, which is completed by the system kernel;
  • The difference between blocking and non-blocking lies in whether the data to be accessed by the process or thread is ready and whether the process or thread needs to wait; waiting is blocking, and no waiting is non-blocking.

In our usual programming and function interface calling process, except for timeout, a result will be returned. Synchronous and asynchronous calls are divided into the following:

  • If the returned result is the final result, it is a synchronous call, such as calling data query sql.
  • If the result returned is an intermediate notification, it is asynchronous: for example, when sending a message to mq, only ack information will be returned. For sending messages, it is synchronous; if viewed from the system architecture level, it is asynchronous because the processing result is processed by the message consumer. If the message is sent successfully, but the network is suddenly disconnected and no ack is received, this is a fault and is not within the scope of discussion.
  • Synchronous call, a callback function can be passed in the parameter: it needs to be executed by language or middleware engine. For example, JVM supports it, node v8 engine supports it. (The callback function needs to be executed in the same context as the calling end, and the stack variables are shared, etc.)

Note: Don't confuse the select keyword!!! There are many technical implementations of IO multiplexing: select, poll, epoll. Please refer to the materials for details. Almost all middleware will use the epoll mode. In addition, due to the different multiplexing implementation mechanisms of various operating systems, epoll, kqueue, and IOCP interfaces have their own characteristics. Third-party libraries encapsulate these differences and provide a unified API, such as Libevent. In addition, Java language and netty provide higher-level encapsulation. JavaNIO and netty retain the select method, which also causes some confusion.

Section: Currently, network middleware uses blocking IO and IO multiplexing models to manage connections and obtain data through network IO. The next section explains some middleware cases that use the IO model.

— 4 —

Specific implementation model of synchronous blocking IO model-PPC, TPC

From the perspective of pure network programming technology, there are two main ideas for server data processing:

  • One is to assign a separate process/thread to each connection processing until the processing is completed. PPC, TPC mode;
  • Another idea is to use the same process/thread to handle several connections at the same time and process the data in the connection through multi-threading and multi-process technology. Reactor mode;

Each process/thread handles one connection, which is called PPC or TPC. PPC stands for Process Per Connection and TPC stands for Thread Per Connection. The network server implemented by the traditional blocking IO model adopts this mode.

Note: close refers specifically to the main process's count of connections, and the connection is actually closed in the child process. In multi-threaded implementation, the main thread does not need a close operation because the parent and child threads share storage. For example: jmm in java

Note: In pre mode, threads and processes are created in advance, and connections are assigned to pre-created threads or processes. There is a herd phenomenon when there are multiple processes.

Applying for threads or processes will take up a lot of system resources. The operating system has limited CPU and memory, and the number of threads that can be managed at the same time is limited. There cannot be too many threads for processing connections. Although you can create processes or threads in advance to process data (prefork/prethead) or use thread pools to reduce thread creation pressure, the size of the thread pool is a ceiling. In addition, parent-child process communication is also more complicated.

Apache MPM prefork (ppc) can support 256 concurrent connections, and Tomcat synchronous IO (tpc) works in blocking IO mode and can support 500 concurrent connections. Java can create a thread pool to reduce the resource overhead of creating certain threads for processing.

The network connection fd can support tens of thousands, but each thread needs to occupy system memory, and the total number of threads that can exist at the same time is limited. In Linux, you can use the command ulimit -s to view the stack memory allocation. More threads will increase the resource scheduling overhead of the CPU. How to solve the imbalance?

Section: The bottleneck of ppc and tpc is the small number of connections that can be managed. The multi-threaded processing business capability was sufficient, but now it is bound to fd, and the thread life cycle is the same as fd, which limits the thread processing capability. Split: Split the fd life cycle from the thread life cycle.

— 5 —

The specific implementation model of the IO model - Reactor

Each process/thread handles multiple connections at the same time (IO multiplexing), and multiple connections share a blocking object. The application only needs to wait on a blocking object, without blocking and waiting for all connections. When a connection has new data to process, the operating system notifies the application, the thread returns from the blocking state (there are better optimizations, see the next section), and starts business processing; this is the idea of ​​the Reactor mode.

Reactor mode refers to an event-driven processing mode for service requests that are simultaneously passed to the service processor through one or more inputs. The server program processes multiple requests from the client and dispatches them synchronously to the processing threads corresponding to the requests. Reactor mode is also called Dispatcher mode. That is, I/O has multiplexed unified listening events, and after receiving the events, they are distributed (Dispatch to a certain process). It is one of the necessary technologies for writing high-performance network servers. Many excellent network middleware are based on this idea.

Note: Since epoll manages a much larger number of connections than select, the underlying implementations in frameworks such as libevent and netty are all epoll-based, but the programming API retains the select keyword. Therefore, epoll_wait is equivalent to select in this article.

The Reactor pattern has several key components:

  • Reactor: Reactor runs in a separate thread and is responsible for listening to fd events and distributing them to appropriate handlers to respond to IO events. Connection establishment events are distributed to Acceptor; read/write processing events are distributed to Handler.
  • Acceptor: responsible for handling connection establishment events and creating corresponding Handler objects.
  • Handlers: Responsible for handling read and write events. Get request data from fd; process data to get corresponding data; send corresponding data. The handler performs the actual work to be done by the IO event.

For IO-bound scenarios, you can use Reactor scenarios, but ThreadLocal cannot be used. Development and debugging are difficult, so it is generally not recommended to implement it yourself. You can use the existing framework.

Section: Reactor can increase the number of manageable network connections to hundreds of thousands. However, the request tasks on so many connections still need to be handled through multi-threading and multi-process mechanisms. Even the load is forwarded to other servers for processing.

— 6 —

Reactor pattern practice case (C language)

Through the examples of several open source frameworks, we can understand the network frameworks in different scenarios, how to use the Reactor mode, and what detailed adjustments have been made.

Note: The actual implementation will certainly be very different from the figure. The client io and send are relatively simple and are omitted in the figure.

A. Single Reactor + single thread processing (one thread overall) represented by redis

As shown in the figure:

  1. Client request-> Reactor object accepts the request and listens to the request event through select(epoll_wait)-> distributes the event through dispatch;
  2. If it is a connection request event->dispatch->Acceptor (accept establishes a connection)->create a Handler object for this connection and wait for subsequent business processing.
  3. If it is not a connection establishment event -> dispatch distribution event -> triggering the Handler object (read, business processing, send) created for this connection, a task/command queue is formed.
  4. The Handler object completes the overall process of read->business processing->send.

Convert the request into a command queue and process it in a single process. Note that the queue in the figure is processed by a single thread and there is no competition.

advantage:

The model is simple. This model is the simplest, easy to implement in code, and suitable for computationally intensive applications.

No need to consider concurrency issues. The model itself is single-threaded, which makes the main logic of the service also single-threaded, so there is no need to consider many concurrency issues, such as locks and synchronization.

Suitable for short-term services. For redis, which basically checks the memory for each event, it is very suitable. First, the concurrency is acceptable, and second, many data structures in redis are very simple to implement.

shortcoming:

Performance issues: With only one thread, the performance of a multi-core CPU cannot be fully utilized.

Sequential execution affects subsequent events. Because all processing is performed sequentially, if faced with a long-lasting event, all subsequent tasks will be delayed, which is unbearable especially for IO-intensive applications.

This is why redis prohibits everyone from using time-consuming commands

Note: redis implements its own IO multiplexing, does not use libevent, the implementation does not match the diagram, and is more lightweight.

This model can handle read and write event operations in a very short time, and can achieve a throughput of about 100,000 QPS (the various redis commands vary greatly).

Note: The redis release version comes with the redis-benchmark performance testing tool, which can be used to calculate qps. Example: Use 50 concurrent connections, issue 100,000 requests, each request has 2 KB of data, and test the performance of the redis server with host 127.0.0.1 and port 6379: ./redis-benchmark -h127.0.0.1 -p 6379 -c 50 -n 100000 -d 2

For network systems with a large number of clients, the emphasis is on multiple clients, that is, the number of concurrent connections. For network systems with a small number of backend connections, long connections are used, with a small number of concurrent connections, but a large number of requests initiated by each connection.

B. Single Reactor + Single Queue + Business Thread Pool

As shown in the figure, we separate the real business processing from the Reactor thread and implement it through the business thread pool. So how does the Handler object of each fd in the Reactor communicate with the Worker thread pool? Through the pending request queue. The client's request to the server can be imagined as a request queue IO. After being processed by the Reactor (multiplexing), it is (split) into a queue of pending work tasks.

Note: There are splits everywhere!

The business thread pool allocates an independent thread pool, gets data from the queue for real business processing, and returns the result to the Handler. After the Handler receives the response result, it sends the result to the client.

Compared with model A, the thread pool technology is used to speed up the client request processing capability. For example, the nonblocking server in thrift 0.10.0 uses this model and can reach tens of thousands of QPS.

Disadvantages: The disadvantage of this model lies in the queue, which is a performance bottleneck. The thread pool needs to lock when getting tasks from the queue, and a high-performance read-write lock will be used to implement the queue.

C. Single Reactor + N Queues + N Threads

This model is a variant of A and B, and memcached adopts this model. The waiting queues are divided into multiple ones, and each queue is bound to a thread for processing. This maximizes the management of network connections by IO multiplexing and releases the bottleneck caused by a single queue. The QPS is estimated to reach 200,000.

However, this solution has a big disadvantage. Load balancing may cause some queues to be busy and some to be idle. Fortunately, memcached is also an in-memory operation and is not very sensitive to load issues, so this model can be used.

D. Single-process Reactor monitoring + N-process (accept+epoll_wait+processing) model

process:

The master (Reactor main process) process listens for new connections and asks one of the worker processes to accept. Here we need to deal with the thundering herd effect problem, see the accept_mutex design of nginx for details

After the worker (subReactor process) process accepts fd, it registers fd to the epoll handle of this process, and this process handles the subsequent read and write events of this fd

The worker process selectively does not accept new fds based on its own load conditions, thereby achieving load balancing.

advantage:

The process hang will not affect this service

The worker actively implements load balancing, which is simpler than the master.

shortcoming:

Multi-process model programming is more complicated, and inter-process synchronization is not as simple as threads.

Processes have more overhead than threads

Nginx uses this model. Since Nginx mainly provides reverse proxy and static content web service functions, the QPS indicator is related to the processing server proxied by Nginx.

Note: The nodejs multi-process deployment method is similar to the nginx method.

Section: We hope to find out which problems the split solves and which problems it causes from these Reactor examples.

— 7 —

Reactor pattern practice case (Java language Netty)

Netty is an asynchronous event-driven network application framework for rapid development of maintainable high-performance protocol servers and clients. Many open source network middlewares in Java use Netty. This article only describes the relevant parts for NIO multiplexing. For many unpacking and sticking, scheduled task heartbeat monitoring, serialization hooks, etc., please refer to the reference. As shown in the figure:

Netty can be configured to determine in which thread (pool) each module runs:

1. Single Reactor Single Thread

  1. EventLoopGroup bossGroup = new NioEventLoopGroup(1); //Netty will only have a single Reactor by default
  2. EventLoopGroup workerGroup = bossGroup; //The listening thread and the worker thread use one
  3. ServerBootstrap server = new ServerBootstrap();
  4. server.group (bossGroup, workerGroup);

2. Single Reactor Multi-threaded SubReactor

  1. EventLoopGroup bossGroup = new NioEventLoopGroup(1);
  2. EventLoopGroup workerGroup = new NioEventLoopGroup(); //Default cup core*2
  3. ServerBootstrap server = new ServerBootstrap();
  4. server.group (bossGroup, workerGroup); //Separate the main thread and worker thread

3. Single Reactor, multi-threaded subReactor, and designated thread pool to handle business

https://netty.io/4.1/api/io/netty/channel/ChannelPipeline.html

We define multiple ChannelHandlers in a pipeline to receive I/O events (e.g., read) and request I/O operations (e.g., write and close). For example, a typical server has the following Handlers in each channel pipeline: (depending on the complexity and characteristics of the protocol and business logic used):

  • Protocol Decoder − Converts binary data (e.g. ByteBuf) into Java objects.
  • Protocol Encoder − Converts Java objects to binary data.
  • Business Logic Handler − Performs the actual business logic (such as database access).

As shown in the following example:

  1. static final EventExecutorGroupgroup = new DefaultEventExecutorGroup(16);
  2. ...
  3. ChannelPipeline pipeline = ch.pipeline();
  4. pipeline.addLast("decoder", new MyProtocolDecoder());
  5. pipeline.addLast("encoder", new MyProtocolEncoder());
  6. // Tell this MyBusinessLogicHandler that the event handler method is not in the I/O thread,
  7. // So that the I/O thread is not blocked, a time-consuming task runs in a custom thread group (pool)
  8. //If your business logic is completely asynchronous or completes quickly, you do not need to specify an additional thread group.
  9. pipeline.addLast( group , "handler", new MyBusinessLogicHandler());

As mentioned in the previous article, web applications accept millions or tens of millions of network connections and manage them into requests and responses, just like a large queue. How to better handle the tasks in the queue involves a series of load issues such as load balancing, locking, blocking, thread pools, multi-processes, forwarding, synchronization and asynchrony. Both stand-alone and distributed machines need to be optimized, and Netty has made a lot of optimizations. This part of the Netty source code is not easy to understand:

Business processing and IO task common thread pool

Custom thread pool processing business

As shown in the figure: In Netty, there is an unfixed number of channels, a fixed NioEventLoop, and an EventExecutor with an external thread pool. Driven by the irregular events of many channels, it is very complicated to coordinate threads.

Here is a question: Why can spring webflux and nodejs based on netty support a large number of connections, but the CPU becomes a bottleneck?

Section: In this way, we initiate a request from the client -> establish a connection to the server -> the server non-blocking listens for transmission -> business processing -> responds to the entire process. Through IO multiplexing, thread pool, and business thread pool, the entire processing chain has no processing bottlenecks or shortcomings, achieving overall high performance and high throughput.

However, the time-consuming processing capability is far lower than the IO connection management capability, and a single machine will reach its ceiling. Further splitting (professional middleware does professional work) and RPC and microservice calls are the solution strategies.

— 8 —

Distributed remote calls (not the end but the beginning)

As can be seen from the previous article, the ultimate bottleneck of a single machine will be business processing. For the Java language, the number of threads cannot be infinitely expanded. Even if the Go language's smaller-overhead coroutine is used, the CPU will become a bottleneck for a single machine. Therefore, distributed remote calls across machines are definitely the direction to solve the problem. There are already many practices in the industry. Let's look at the three typical architecture diagrams to see what problems the evolution solves and how it solves them:

Note: This article does not discuss SOA, RPC, microservices, etc., but only focuses on the basis and goals of the split.

A. Monolithic Application

B. Separate network connection management and static content

C. Business functional split

A:Typical single application.

A->B: Connection management and business processing are separated. Using nginx, which has powerful network connection management capabilities, business processing is separated into multiple machines.

B->C: Business processing is split from the functional perspective. Some businesses focus on protocol analysis, some focus on business judgment, and some focus on database operations. Continue to split.

Through Figure C, from the perspective of high performance, we can see the principles and points to note for service layering (there are many technical selections for each layer):

1. Reverse proxy layer (associated with https connection)

  • It can be achieved through nginx cluster, or through lvs, f5.
  • Through the upper-layer nginx implementation, we can know that this layer handles a large number of http or https requests.
  • The core indicators are: number of concurrent connections, number of active connections, inbound and outbound traffic, number of inbound and outbound packets, throughput, etc.
  • Internal optimization of protocol parsing module, compression module, packet processing module, etc. The throughput of requests proxied out by the key direction, that is, the processing capacity of nginx forwarding to the backend application server, determines the overall throughput.
  • Static files all go through CDN.
  • As https authentication is time-consuming, it is recommended to use http2.0 or keep the connection longer. But this also depends on the business situation. For example, whether each app interacts frequently with the backend. After all, maintaining too many connections is costly and affects multiplexing performance.

2. Gateway layer (general non-business operations)

The reverse proxy layer connects to the gateway layer via the HTTP protocol, and the two communicate via the intranet IP, which is much more efficient. We assume that the gateway layer uses TCP long connections to the downstream, which can be achieved by RPC frameworks such as Dobbo in the Java language.

The gateway layer mainly does several things:

  • Authentication
  • Packet integrity check
  • Convert http json transmission protocol to java object
  • Route escape (converted into microservice calls)
  • Service governance related (current limiting, downgrade, circuit breaking, etc.) functions
  • Load Balancing

The gateway layer can be implemented by open source Zuul, spring cloud gateway, nodejs, etc. Nginx can also be customized for gateway development and physically merged with the reverse proxy layer.

3. Business logic layer (business-level operations)

From this layer, we can consider vertical stratification according to business logic. For example: user logic layer, order logic layer, etc. If split in this way, we may abstract a layer of business logic layer. We try to ensure that the business logic layer does not call horizontally, and only calls downstream from upstream.

  • Business logic judgment
  • Business logic processing (combination)
  • Distributed transaction implementation
  • Distributed lock implementation
  • Business Cache

4. Data access layer (database storage related operations)

  • Focus on data addition, deletion, modification and query operations.
  • ORM package
  • Hide the details of sub-libraries and sub-tables.
  • Cache design
  • Shielding storage tier differences
  • Data storage idempotency implementation

Note: This section quotes some views from Mr. Sun Xuan's "Million-Year Salary Architect Course". I recommend this course, which covers everything from architectural practice, microservice implementation, service governance, and so on, from the essence to the actual combat.

Below the gateway layer and above the database, the RPC middleware technology selection and technical indicators are as follows (from dubbo official website):

  • The core indicators are: concurrency, TQps, and Rt response time.
  • Protocol selection factors: dubbo, rmi, hesssion, webservice, thrift, memached, redis, rest
  • Number of connections: Long connections are usually single; short connections require multiple
  • Whether long connection: long or short connection
  • Transport protocol: TCP, http
  • Transmission mode: synchronous, NIO non-blocking
  • Serialization: Binary (Hessian)
  • Application scope: large files, very large strings, short strings, etc.
  • Select according to the application scenario. Generally, the default is dubbo.

Sections:

In the stand-alone era, each thread managed a network connection, and then through IO multiplexing, a single thread managed the network connection, freeing up resources to handle business. Then, the IO thread pool and the business thread pool were separated. You can find a pattern: the client connection request is the starting point, and the backend processing capacity is gradually balanced and strengthened. The business processing capacity is always behind the receiving processing capacity.

Reverse proxy era: nginx can manage enough connections, and the backend can forward to N application servers tomcat. To some extent, more efficient use of resources, through the selection of hardware and software, the management connection (function) and the processing connection (function) are physically separated, and the software and hardware work together to handle what they are better at.

SOA, microservice era: (The emergence of SOA is actually for low coupling, and has little to do with high performance and high concurrency) There are many types of business processing. Some are operation-intensive; some require operating databases; some only need to read some data from cache; some services are very high in usage; some use frequency is low. In order to make better use, there are two splitting mechanisms. The services that operate databases are separated (data access layer), and the business logic processing is split out (business logic layer); according to the above logic, it is possible that one nginx + 3 tomcat gateways + 5 duboo business logic + 10 duboo data access configuration is appropriate. The purpose of our configuration is that the exclusive services processed by each layer can suppress the server to 60% of the resource usage.

Note: The article only focuses on horizontal hierarchy at the functional level. The vertical level also needs to be layered. For example: User management and order management are two different types of businesses, with different business technical characteristics and access frequency. The storage level also requires vertical library and table division. This article will be omitted for the time being.

In the stand-alone stage, multi-threaded and multi-process are actually equivalent to a kind of vertical concurrent splitting, trying to ensure statelessness and avoid locks as much as possible, which is consistent with the principle of stateless and distributed locks of microservices.

— 9 —

Summarize

Looking back at the previous article, what should the client do after connecting to the server? Is the performance bottleneck maintaining so many connections? Or is the processing for each connection not meeting the requirements imbalance? How to break the deadlock? From the description of the stand-alone internals and then to the physical machine splitting, there are three points and their importance:

Focus on balance: Only when the architecture reaches balance can it be a high-performance, high-concurrency architecture. Any performance problem will be caused by a certain point. It even refers to the need to balance business needs and complexity.

The way to split: the right thing to solve the problem with the right technology and the right middleware. Specific: How to split horizontally and vertically also requires analysis of the scenario.

Understand business scenarios, the nature of problems & & Understand solutions in common scenarios: According to the ideas of discovering problems, analyzing problems, and solving problems, we have prepared all the ammunition databases, and the process of solving problems is a matching process.

In addition to the technologies and splitting solutions mentioned in the article, many technical points can improve throughput and performance, as shown below:

  • IO multiplexing: manage more connections
  • Thread pooling technology: tapping the potential of multi-core CPUs
  • zero-copy: reduces the number of interactions between user-state and kernel-state. For example, transferTo in Java and sendfile system interface in Linux;
  • Disk order writing: Reduce addressing overhead. This technology is used for message queues or database logs.
  • Compress better protocols: reduce expenses on network transmission, such as: custom or binary transmission protocols;
  • Partition: In the storage system, all partitions are considered partitions; in microservices, the design service is stateless, which can also be understood as partitioning.
  • Batch transfer: Typical database batch technology. Many network middleware can also be used, such as in message queues.
  • Indexing technology: This is not a database indexing technology. Instead, we design indexes that fit business scenarios to provide efficiency. For example: kafka uses some hack indexing techniques for file storage.
  • Cache design: When data life modification is not frequent, changes are very regular, and the cost of generating it is too high, caching can be considered
  • Space exchange time: In fact, partitioning, indexing technology, and cache technology can all be classified into this category. For example: we use inverted index to store data, use multiple data and multiple nodes to provide services, etc.
  • Selection of network connections: long and short connections, reliable, non-reliable protocols, etc.
  • Unpacking and sticking: batch and protocol selection are related to this.
  • High-performance distributed lock: locks are inevitable in concurrent programming. Try to use high-performance distributed locks to cas optimistic locks and try to avoid pessimistic locks. If business allows, try to asynchronous locks, do not block locks synchronously to reduce lock competition.
  • Flexible transactions replace rigid transactions: Some exceptions or failures cannot be restored by trying to retry.
  • Final consistency: If the business scenario allows, try to ensure the final consistency of the data.
  • Non-core business asynchronization: convert certain tasks into another queue (message queue), and the consumer side can process them in batches and multiple consumers.
  • direct IO: For example, applications such as databases that build their own cache mechanisms, use directIO directly and abandon the cache provided by the operating system.
  • Note: Many people who are separated from business scenarios can only be talked about on paper. However, if you don’t understand the means, you will be confused when encountering the scenario. How to divide and conquer the super queue formed by client requests and break them one by one is the overall idea.

<<:  Top 9 bandwidth monitoring tools for enterprise networks

>>:  Are you among the unemployed after 5G?

Recommend

Which broadband operator do you use at home?

My hometown is in the rural area of ​​Hebei. My b...

Experts discuss: How will 5G accelerate after the epidemic?

It was supposed to be a time to get rid of the ol...

What is edge computing and why is it important?

Edge computing is changing the way millions of de...

Have you already moved to SDN network?

Today's networks are constantly changing and ...

The 10 coolest web startups of 2020

Changing the rules of the online market It’s safe...

Choosing a PoE Standard for Your Design: PoE, PoE+, and PoE++

Power over Ethernet standards have important diff...

Seven common mistakes in open source projects

【51CTO.com Quick Translation】 Starting a new open...

Excellent use cases and benefits of edge computing

The use of edge computing in the enterprise will ...