Usually, our applications do not need to handle thousands of users in parallel, nor do they need to process thousands of messages in one second. We only need to cope with dozens or hundreds of concurrently connected users, and we can bear such a large load in internal applications or some microservice applications. In this case we can use some high-level frameworks/libraries which are not optimized in terms of threading model/memory used and still afford some reasonable resources and reasonably fast delivery time.
However, sometimes we encounter a situation where a part of our system needs to scale better than other applications. Writing this part of the system with traditional methods or frameworks can lead to huge resource consumption and the need to start many instances of the same service to handle the load. The algorithms and methods that lead to handling thousands of connections are also known as the C10K problem. In this article I will mainly focus on optimizations that can be done on the TCP connection/traffic side, to optimize (micro)service instances to waste as little resources as possible, to have a deep understanding of how the OS works with TCP and Sockets, and last but not least, how to get a deep understanding of all these things. Let’s get started. I/O Programming Strategies Let's describe what type of I/O programming model we currently have and what options we need to choose from when designing an application. First of all, there is no good or bad approach, only one that is more suitable for our current use case. Choosing the wrong approach can have very inconvenient consequences in the future. It can lead to wasted resources or even rewriting the application from scratch. Blocking I/O with blocking processing Number of threads per connection server The idea behind this approach is that if there aren't any dedicated/idle threads, no socket connections will be accepted (we'll show what that means later). Blocking in this context means that a specific thread is bound to a connection and always blocks when reading or writing to the connection.
The simplest version of a socket server, starting at port 5050, reading from an InputStream and writing to an OutputStream in a blocking manner. Useful when we need to transfer a small number of objects over a connection, then close it and start a new one when needed.
Non-blocking I/O with blocking processing Thread pool based server This is the category that most of the well-known enterprise HTTP servers fall into. Generally speaking, this model uses multiple thread pools to make processing in multi-CPU environments more efficient and more suitable for enterprise applications. There are several ways to configure thread pools, but the basic idea is exactly the same in all HTTP servers. See HTTP Grizzly I/O Strategies for all the possible strategies that can be configured based on a thread pool-based non-blocking server in general.
We need to clarify non-blocking terminology:
The blocking nature of the business logic is the main reason why the worker pool is so large, we just need to let a large number of threads play a role to improve throughput. Otherwise, under higher load (for example, more HTTP requests), we may end up with all threads in a blocked state and no threads available for request processing (no threads in a runnable state can be executed on the CPU). Advantages Even though the number of requests was quite high and many of our worker threads were blocked on some blocking operations, we were able to accept new connections even though we might not be able to process their requests immediately and the data had to wait in the TCP receive buffer. This programming model is implicitly used by many frameworks/libraries (Spring Controllers, Jersey, ...) and HTTP servers (Jetty, Tomcat, Grizzly ...) because it is very easy to write business code that lets the thread block if really needed. shortcoming Parallelism is usually not determined by the number of CPUs, but is limited by the nature of the blocking operation and the number of worker threads. In general, this means that if the ratio of time spent in blocking operations (I/O) and further execution (in the middle of a request) is too high, then we can get:
Large thread pools lead to context switching and inefficient use of CPU cache. How to set up a thread pool OK, we have one or more thread pools to handle blocking business operations. But what is the optimal size of the thread pool? We may encounter two questions:
I think you can refer to Brian Goetz's book Java Concurrency in Practice, which says that sizing the thread pool is not an exact science, it is more about understanding your environment and the nature of the task.
If our program contains I/O or other blocking operations, you need a larger pool because your threads are not allowed to stay on the CPU all the time. You need to use some profiler or benchmark to estimate the ratio of waiting time to computing task time and observe the CPU utilization at different stages of production workload (peak time vs. off-peak time). Non-blocking I/O for non-blocking processing Server based on the same number of threads as CPU cores This strategy is most effective if we can manage most of the workload in a non-blocking manner. This means that handling sockets (accepting connections, reading, writing) is implemented using non-blocking algorithms, but even business processing does not contain any blocking operations. The poster child for this strategy is the Netty framework, so let’s take a deep dive into the architectural foundations of how this framework is implemented to understand why it’s best suited for solving the C10K problem. If you want to learn more about how it works, then I can recommend the following resources: Netty in Action - by Norman Mauer. Written by the author of the Netty Framework Norman Mauer. This is a valuable resource for understanding how to implement a client or server based on Netty using handlers with various protocols. I/O library with asynchronous programming model Netty is an I/O library and framework that simplifies non-blocking IO programming and provides an asynchronous programming model for events that occur during the server lifecycle and incoming connections. We just need to connect the callback with our lambdas and we get everything for free. Many protocols can be used without relying on one large library. It is very frustrating to start building applications with pure JDK NIO, but Netty includes features that keep the programmer at a low level and provide the possibility to make many things more efficient. Netty already includes most of the well-known protocols, which means that we can use them more efficiently than with a lot of boilerplate in higher-level libraries (such as Jersey/Spring MVC for HTTP/REST). Identify the right non-blocking use cases to fully exploit Netty's capabilities I/O handling, protocol implementations, and all other handlers should use non-blocking operations to never stop the current thread. We can always use an additional thread pool for blocking operations. However, if we need to switch the processing of each request to a dedicated thread pool to perform blocking operations, we have barely used Netty's power, as we will most likely end up in the same situation with non-blocking IO, i.e. blocking processing - one big thread pool that happens to be in different parts of the application. In the above diagram, we can see the main components of the Netty architecture. EventLoopGroup - collects event loops and provides a channel to register to one of the event loops. Event loop - handles all I/O operations for registered channels of a given event loop. EventLoop runs on only one thread. So the optimal number of event loops for one EventLoopGroup is the number of CPUs (some frameworks use multiple CPUs + 1 to have extra threads in case of page faults). Pipeline - maintains the execution order of handlers (components that are ordered and executed when a certain input or output event occurs contain the actual business logic). Pipeline and handlers are executed on threads belonging to EventLoop, so blocking operations in handlers block all other processing/channels on the given EventLoop. |
<<: TCP SYN Queue and Accept Queue
>>: Understanding the new features of HTTP/2 and HTTP/3 (recommended)
With the freezing of 5G 1.0 version, the first co...
[[382833]] I saw this algorithm a long time ago w...
[[177138]] In the near future, 2 million househol...
Whether it is Tieba, Weibo or the discussion area...
I would like to share some information about high...
[51CTO.com original article] Recently, Riverbed l...
Currently, business development often leads to a ...
RackNerd has once again released a promotional pa...
I have already introduced to you what a server is...
On April 18, 2018, at HAS2018, Huawei released th...
Last month, I shared information about HostingVie...
Power is the lifeline of data centers, and electr...
[[427165]] Learn more about BeautifulSoup Scrapin...
NFV automation and NFV orchestration have overlap...
Many communication protocols are often used in em...