Java Server Model - TCP Connection/Flow Optimization

Java Server Model - TCP Connection/Flow Optimization

Usually, our applications do not need to handle thousands of users in parallel, nor do they need to process thousands of messages in one second. We only need to cope with dozens or hundreds of concurrently connected users, and we can bear such a large load in internal applications or some microservice applications.

In this case we can use some high-level frameworks/libraries which are not optimized in terms of threading model/memory used and still afford some reasonable resources and reasonably fast delivery time.

[[282565]]

However, sometimes we encounter a situation where a part of our system needs to scale better than other applications. Writing this part of the system with traditional methods or frameworks can lead to huge resource consumption and the need to start many instances of the same service to handle the load. The algorithms and methods that lead to handling thousands of connections are also known as the C10K problem.

In this article I will mainly focus on optimizations that can be done on the TCP connection/traffic side, to optimize (micro)service instances to waste as little resources as possible, to have a deep understanding of how the OS works with TCP and Sockets, and last but not least, how to get a deep understanding of all these things. Let’s get started.

I/O Programming Strategies

Let's describe what type of I/O programming model we currently have and what options we need to choose from when designing an application. First of all, there is no good or bad approach, only one that is more suitable for our current use case. Choosing the wrong approach can have very inconvenient consequences in the future. It can lead to wasted resources or even rewriting the application from scratch.

Blocking I/O with blocking processing

Number of threads per connection server

The idea behind this approach is that if there aren't any dedicated/idle threads, no socket connections will be accepted (we'll show what that means later). Blocking in this context means that a specific thread is bound to a connection and always blocks when reading or writing to the connection.

  1. public static void main(String[] args) throws IOException {
  2. try (ServerSocket serverSocket = new ServerSocket(5050)) {
  3. while (true) {
  4. Socket clientSocket = serverSocket .accept();
  5. var dis = new DataInputStream(clientSocket.getInputStream());
  6. var dos = new DataOutputStream(clientSocket.getOutputStream());
  7. new Thread(new ClientHandler(dis, dos)).start();
  8. }
  9. }
  10. }

The simplest version of a socket server, starting at port 5050, reading from an InputStream and writing to an OutputStream in a blocking manner. Useful when we need to transfer a small number of objects over a connection, then close it and start a new one when needed.

  • It is achievable even without any high-level libraries.
  • Use blocking streams for reading/writing (wait on a blocking InputStream read operation which fills the provided byte array with bytes available in the TCP receive buffer at that time and returns the number of bytes or -1 - end of stream) and consume bytes until we have enough data to construct the request.
  • A big problem and inefficiency arises when we start creating threads for unbounded incoming connections. We will pay the price for very expensive thread creation and memory impact that goes hand in hand with mapping one Java thread to one kernel thread.
  • It is not suitable for "real" production, unless we really need an application with a low memory footprint and don't want to load a lot of classes belonging to some framework.

Non-blocking I/O with blocking processing

Thread pool based server

This is the category that most of the well-known enterprise HTTP servers fall into. Generally speaking, this model uses multiple thread pools to make processing in multi-CPU environments more efficient and more suitable for enterprise applications. There are several ways to configure thread pools, but the basic idea is exactly the same in all HTTP servers. See HTTP Grizzly I/O Strategies for all the possible strategies that can be configured based on a thread pool-based non-blocking server in general.

  • First thread pool for accepting new connections. It can even be a single thread pool if one thread can manage the rate of incoming connections. Usually there are two backlogs to fill and the next incoming connection rejected. If possible, check if persistent connections are used correctly.

  • A second thread pool for reading from/writing to sockets in a non-blocking manner (selector threads or IO threads). Each selector thread handles multiple clients (channels).
  • A third thread pool is used to separate the non-blocking and blocking parts of request processing (usually called worker threads). Some blocking operations cannot block the selector thread because all other channels cannot make any progress (the channel group has only one thread and this thread will be blocked).
  • Non-blocking read/write is implemented using buffers, whenever a specific thread handling a request is not satisfied (because they don't have enough data to construct e.g. an HTTP request), the selector thread reads new bytes from the socket and writes to a dedicated buffer (the pool buffer).

We need to clarify non-blocking terminology:

  • We are talking in the context of a Socket server, then non-blocking means that the thread is not bound to an open connection and does not wait for incoming data (or even write data if the TCP send buffer is full), it just tries to read and if there are no bytes then no bytes are added to the buffer for further processing (constructing the request) and the given selector thread will continue reading from another open connection.
  • However, when it comes to processing requests, the code is blocking in most cases, meaning that we execute some code that blocks the current thread, which is waiting for an I/O bound process (database query, HTTP call, reading from disk, etc.) or some long CPU bound process (calculating hashes/factorials, crypto mining, ...). If the execution is completed, the thread is woken up and execution continues in some business logic.

The blocking nature of the business logic is the main reason why the worker pool is so large, we just need to let a large number of threads play a role to improve throughput. Otherwise, under higher load (for example, more HTTP requests), we may end up with all threads in a blocked state and no threads available for request processing (no threads in a runnable state can be executed on the CPU).

Advantages

Even though the number of requests was quite high and many of our worker threads were blocked on some blocking operations, we were able to accept new connections even though we might not be able to process their requests immediately and the data had to wait in the TCP receive buffer.

This programming model is implicitly used by many frameworks/libraries (Spring Controllers, Jersey, ...) and HTTP servers (Jetty, Tomcat, Grizzly ...) because it is very easy to write business code that lets the thread block if really needed.

shortcoming

Parallelism is usually not determined by the number of CPUs, but is limited by the nature of the blocking operation and the number of worker threads. In general, this means that if the ratio of time spent in blocking operations (I/O) and further execution (in the middle of a request) is too high, then we can get:

  • Many blocked threads on blocking operations (database queries...)
  • A large number of requests waiting to be processed by worker threads, and
  • Very unused CPU because no threads can continue to execute

Large thread pools lead to context switching and inefficient use of CPU cache.

How to set up a thread pool

OK, we have one or more thread pools to handle blocking business operations. But what is the optimal size of the thread pool? We may encounter two questions:

  • The thread pool is too small, we don't have enough threads to cover the time when all threads are blocked, say waiting for I/O operations, and your CPU is not being used efficiently.
  • The thread pool is too large and we pay the price of having many threads that are actually idle (see the cost of running many threads below).

I think you can refer to Brian Goetz's book Java Concurrency in Practice, which says that sizing the thread pool is not an exact science, it is more about understanding your environment and the nature of the task.

  • How many CPUs and how much memory does your environment have?
  • Does the task perform primarily computation, I/O, or some combination?
  • Do they require scarce resources (JDBC connections)? Thread pools and connection pools affect each other, and it may not make sense to increase thread pools to get better throughput when we fully utilize the connection pool.

If our program contains I/O or other blocking operations, you need a larger pool because your threads are not allowed to stay on the CPU all the time. You need to use some profiler or benchmark to estimate the ratio of waiting time to computing task time and observe the CPU utilization at different stages of production workload (peak time vs. off-peak time).

Non-blocking I/O for non-blocking processing

Server based on the same number of threads as CPU cores

This strategy is most effective if we can manage most of the workload in a non-blocking manner. This means that handling sockets (accepting connections, reading, writing) is implemented using non-blocking algorithms, but even business processing does not contain any blocking operations.

The poster child for this strategy is the Netty framework, so let’s take a deep dive into the architectural foundations of how this framework is implemented to understand why it’s best suited for solving the C10K problem. If you want to learn more about how it works, then I can recommend the following resources:

Netty in Action - by Norman Mauer. Written by the author of the Netty Framework Norman Mauer. This is a valuable resource for understanding how to implement a client or server based on Netty using handlers with various protocols.

I/O library with asynchronous programming model

Netty is an I/O library and framework that simplifies non-blocking IO programming and provides an asynchronous programming model for events that occur during the server lifecycle and incoming connections. We just need to connect the callback with our lambdas and we get everything for free.

Many protocols can be used without relying on one large library.

It is very frustrating to start building applications with pure JDK NIO, but Netty includes features that keep the programmer at a low level and provide the possibility to make many things more efficient. Netty already includes most of the well-known protocols, which means that we can use them more efficiently than with a lot of boilerplate in higher-level libraries (such as Jersey/Spring MVC for HTTP/REST).

Identify the right non-blocking use cases to fully exploit Netty's capabilities

I/O handling, protocol implementations, and all other handlers should use non-blocking operations to never stop the current thread. We can always use an additional thread pool for blocking operations. However, if we need to switch the processing of each request to a dedicated thread pool to perform blocking operations, we have barely used Netty's power, as we will most likely end up in the same situation with non-blocking IO, i.e. blocking processing - one big thread pool that happens to be in different parts of the application.

In the above diagram, we can see the main components of the Netty architecture.

EventLoopGroup - collects event loops and provides a channel to register to one of the event loops.

Event loop - handles all I/O operations for registered channels of a given event loop. EventLoop runs on only one thread. So the optimal number of event loops for one EventLoopGroup is the number of CPUs (some frameworks use multiple CPUs + 1 to have extra threads in case of page faults).

Pipeline - maintains the execution order of handlers (components that are ordered and executed when a certain input or output event occurs contain the actual business logic). Pipeline and handlers are executed on threads belonging to EventLoop, so blocking operations in handlers block all other processing/channels on the given EventLoop.

<<:  TCP SYN Queue and Accept Queue

>>:  Understanding the new features of HTTP/2 and HTTP/3 (recommended)

Recommend

With the support of celebrities, how fast can 5G run?

With the freezing of 5G 1.0 version, the first co...

LRU implementation with expiration time

[[382833]] I saw this algorithm a long time ago w...

Three major challenges faced by enterprise infrastructure modernization

Currently, business development often leads to a ...

Learn about server network cards in one minute

I have already introduced to you what a server is...

Green operation, data center still depends on automation

Power is the lifeline of data centers, and electr...

Big and small! The little sister tells you everything about BeautifulSoup

[[427165]] Learn more about BeautifulSoup Scrapin...

The difference between NFV automation and NFV orchestration

NFV automation and NFV orchestration have overlap...

There are five main differences between RS232 and RS485

Many communication protocols are often used in em...