Kubernetes uses OkHttp client for network load balancing

Kubernetes uses OkHttp client for network load balancing

During an internal Java service audit, we discovered that some requests were not being properly load balanced on the Kubernetes (K8s) network. The issue that led us to dig deeper was a sharp increase in HTTP 5xx error rates, due to very high CPU usage, a large number of garbage collection events, and timeouts, but this was only happening to a few specific Pods.

This situation is not visible in all cases, as it affects multi-pod services with different numbers of source and target pods. In this blog post, I will discuss the measures we take to load balance this set of services and pods.

How are requests balanced across the Pods in our deployment?

Two source Pods send requests to six target Pods.

It can be clearly seen that the request distribution is uneven between the target Pods.

But why is this happening?

The default load balancing scheduler for the K8s load balancer (IPVS proxy mode) is set to round robin. IPVS provides more options for balancing traffic to Pod backends. When testing these options, we found that when it comes to our services, the behavior is the same regardless of the configuration, and these services use internal routing to communicate with each other.

What exactly happened? IPVS in K8s balances traffic based on connections, which works pretty well most of the time. Our services use OkHttp as HTTP client to communicate with each other. Our problem is related to the way this HTTP client behaves. With the default configuration, it creates a connection to the server, and if you don't want to explicitly close the connection in your code because it's too expensive, it keeps and re-establishes the connection to the previous partner. This means that the client tries to keep the connection to the target and send requests over that specific connection. Normally, it creates a 1:1 connection, which is not balanced on the K8s side.

what to do?

If you need to scale or want to make your service properly load balanced, you need to update the configuration on the client side. OkHttp provides the ConnectionPool feature. When using the ConnectionPool option, the connection will be established for a limited period of time and then a new connection is repeatedly set up, so IPVS can load balance as it has a large number of new connections that should be routed to the target based on the IPVS scheduler. Basically, it works like a machine gun instead of a laser beam.

How are we doing after releasing this update?

Load-balanced connections between multiple Pod services are implemented using an updated HTTP client and the default IPVS scheduler.

What changes have been made?

We performed extensive testing using various configurations to measure response times and performance overhead to ensure load balancing. Below are the major code changes that appear to have no noticeable performance overhead.

Example code changes

There is an option to set the scheduler to be able to send more requests in parallel. In our case, this ends up establishing a set of recently closed connections and then continues to use only one. Also, we try to prevent opening new connections too often, since executing requests is much less demanding than opening new connections.

What are the results?

Network and resource usage is now more balanced than before - no huge or long-lasting spikes, and no "noisy neighbor" effect that only affects some Pods in the deployment. Now that almost all Pods are utilized in almost the same way, we have been able to reduce the number of Pods in our deployment. We know this is not perfect, but it is good enough for our use case because it does not introduce noticeable performance overhead to the service or IPVS load balancer.

Load balancing of requests on current Pods

in conclusion

Performing a thorough service audit on a regular basis is beneficial because it can reveal future optimizations that will benefit all services and save you time when troubleshooting strange symptoms for features that should have worked right away. Also, take the time to review documentation, test, discuss, and understand the impact of default settings for connection setup and handling when using client libraries to ensure they will behave as you expect.

<<:  ERP, CRM, SRM, PLM, HRM, OA...what do they all mean?

>>:  What you need to know about Wi-Fi 7

Recommend

How is the world's largest OpenRAN operator doing?

On February 14, Japanese operator Rakuten Mobile ...

Network performance metrics pose data center challenges

The networking world is known for widespread chan...

Are you still worried about network operations? SD-WAN is here to save you!

Software-defined WAN or SD-WAN is a great example...

How to Understand Fog Computing and Edge Computing in Simple Terms

Over the past few decades, there has been a huge ...

[6.18] Moack: $35.64/month-2xE5-2630L/32GB/1TB/10M bandwidth/South Korea server

Moack.co.kr is a Chinese merchant who mainly sell...