Practice on optimizing VUA forwarding performance of vivo unified access gateway

Practice on optimizing VUA forwarding performance of vivo unified access gateway

VLB stands for vivo load balance. As the IDC traffic entrance of vivo Internet services, vivo load balancing undertakes the public network traffic of many important services. This article explores the performance optimization of VLB's seven-layer load VUA HTTPS to obtain the best forwarding performance.

1. Overall architecture of vivo VLB

▲ Figure 1 vivo VLB overall architecture

The core of the VLB overall architecture includes: a four-layer load VGW based on DPDK, a seven-layer load VUA based on Apache APISIX and NGINX extended functions, and a unified management and operation platform.

Its main features are:

  • High performance : It has the capability of supporting tens of millions of concurrent users and millions of new users.
  • High availability : ECMP, health checks, etc. are used to provide multi-level high availability from the load itself to the business server.
  • Scalability : Supports four-layer/seven-layer load clusters, horizontal elastic scaling of business servers, and grayscale release.
  • Layer 4 load capacity : VIP is announced to the switch through BGP; balancing algorithms such as round-robin, weighted round-robin, weighted least connections, and consistent hashing are supported; FullNAT forwarding mode facilitates deployment, etc.
  • Seven-layer load capacity : supports forwarding rule configuration based on domain name and URL; supports balancing algorithms such as polling, weighted polling, etc.
  • SSL/TLS capabilities : management and configuration of certificates, private keys, and handshake policies; support for SNI configuration; support for SSL offload hardware acceleration based on a variety of accelerator cards, etc.
  • Traffic prevention and control : Provide certain Syn-Flood protection capabilities; provide network traffic control measures such as QoS flow control, ACL access control, etc.
  • Management and control platform : supports configuration, monitoring and alarm of network and business indicators in multiple dimensions.

This article provides an overview of two methods for optimizing the SSL/TLS performance of Layer 7 load VUAs in VLB:

  • QAT_HW based on hardware technology
  • QAT_SW based on instruction set optimization

2. VUA Layer 7 Load Balancing

2.1 Introduction to VUA

At present, the biggest pain point of the company's access layer is dynamic upstream, dynamic routing, dynamic certificates, traffic grayscale, blacklist and whitelist, dynamic scheduling, log query and tracking, etc. In order to support the sustainable development of the company's business, especially the comprehensive containerization of the business, it is urgent to build a unified access platform that integrates the current online NGINX cluster and Ingress NGINX to carry the company's web, mobile, partners, internal systems, and IOT device traffic, align the industry's access layer capabilities, and ensure the smooth development of the business.

VUA definition: vivo Unified Access.

vivo unified access layer is a secondary development based on APISIX-2.4.

2.2 VUA Architecture

▲ Figure 2 APISIX architecture (Image source: Github-apache/apisix)

  • Apache APISIX : OpenResty 1.19.3.1 + Lua composition (the component itself is stateless).
  • Manager-api : Developed in Go language, used for configuration management and changes.
  • APISIX-Ingress-Controller : Developed based on the K8S native Controller mechanism, it supports multi-copy Leader-Election hot standby mechanism. It mainly monitors the K8s api server and reports pod information to the Manager-api.
  • Etcd : used to save routing, upstream and other configuration information.

Figure 3 VUA architecture

3. QAT Acceleration Technology

Intel QuickAssist Technology OpenSSL Engine (QAT_Engine) supports hardware acceleration and optimized software based on vectorized instructions. This feature started with the 3rd Generation Intel® Xeon® Scalable Processors, providing users with more options to accelerate their workloads.

3.1 Asynchronous Architecture

VUA has expanded the asynchronous event processing mechanism for asynchronous hardware engines based on NGINX's native asynchronous processing framework. The overall interaction process is shown in the following figure:

  • ASYNC_start_job: NGINX calls the SSL lib library interface SSL_do_handshake to start an asynchronous task.
  • RSA/ECDH encryption and decryption operations.
  • The QAT engine sends the encrypted message to the driver, creates an asynchronous event listener fd, and binds the fd to the context of the asynchronous task.
  • qat_pause_job: This interface is called to save the stack information of asynchronous task execution. The task is temporarily suspended, waiting for the hardware encryption and decryption operation to complete. At the same time, the process stack switches to NGINX IO to call the main process, ssl returns WANT_ASYNC, and NGINX starts to process other waiting times.
  • The NGINX IO processing framework obtains the asyncfd saved in the asynchronous task context and adds it to the epoll queue to start listening.
  • When the accelerator card completes the task, the QAT engine calls the qat_wake_job interface to wake up the task (that is, mark the async fd as readable). QAT provides NGINX with multiple polling methods to poll the accelerator card response queue. Currently, VUA uses a heuristic polling method, and the specific parameters can be defined in the configuration file.
  • NGINX processes asynchronous events and calls the ASYNC_start_job interface of the asynchronous task framework again. At this time, the program switches contexts and the stack jumps back to the place where the job was paused before.

3.2 Overview of QAT Component Architecture

  • Application
    The application layer mainly consists of two parts:
    (1) A patch for the QAT asynchronous framework that provides support for asynchronous mode;
    (2) QAT engine. Engine is a mechanism supported by OpenSSL itself to abstract the implementation methods of various encryption algorithms. Intel provides the open source code of QAT engine to specifically support QAT acceleration.
  • SAL (service access layer)
    The service access layer provides acceleration card access services to the upper-layer applications. Currently, QAT mainly provides two services: crypto and compression. Each service is independent of each other. The access layer encapsulates a series of practical interfaces, including creating instances, initializing message queues, sending and receiving requests, etc.
  • ADF (acceleration driver framework)
    The accelerator card driver framework provides the driver support required by SAL, as shown in the figure above, including intel_qat.ko, 8950pci driver, usdm memory management driver, etc.

3.3 QAT_HW and QAT_SW

QAT_HW is based on the QAT hardware accelerator card and uses the QAT driver linked in the qatengine.so library through the OpenSSL engine.

QAT_SW is based on QAT software acceleration, using crypto_mb and ipsec_mb libraries linked in qatengine.so library through OpenSSL engine. Based on Intel AVX-512 integer multiply-add (IFMA) operation buffer library, when user builds instruction support qat_sw, operations are performed through multiple requests maintained in batch queues, and batch requests are submitted to up to 8 Crypto Multi-buffer APIs using OpenSSL asynchronous infrastructure, which process them in parallel using AVX512 vector instructions. Intel® QAT software acceleration mainly for asymmetric PKE and AES-GCM, RSA supports key sizes 2048, 3072, 4096, AES128-GCM, AES192-GCM and AES256-GCM.

If the platform supports both QAT_HW and QAT_SW, the default is to use QAT hardware acceleration for asymmetric algorithms and symmetric chained ciphers, and QAT software acceleration for symmetric GCM ciphers. If the platform does not have QAT hardware support, it will use QAT_SW acceleration to implement asymmetric algorithms supported in qatengine.

The following diagram illustrates the high-level software architecture of QAT_Engine. Applications such as NGINX and HAProxy are common applications that interface with OpenSSL. OpenSSL is a toolkit for TLS/SSL protocols, and starting with version 1.1.0, it has developed a modular system to plug in device-specific engines. As mentioned above, there are two independent internal entities in QAT_Engine through which acceleration can be performed.

▲ (Image source: Github-intel/QAT_Engine)

4. Comparison of performance improvements of optimization solutions

4.1 QAT_HW

This solution uses the Intel 8970 accelerator card for testing and uses RSA certificates for HTTPS encryption and decryption.

(1) Test method

The execution machine deploys the VUA adapted to the QAT engine, and the packet test machine performs stress testing and packet injection. After the CPU load reaches 100%, the new QPS comparison of the VUA after QAT optimization is obtained.

(2) Test scenario

(3) Comparison of local test data

Performance comparison using QAT accelerator card

The QAT card optimization solution uses VUA to test HTTPS traffic and compares it with the OpenSSL software encryption and decryption scenario:

  • With the QAT accelerator card, the average QPS of RSA increases by 1.27 times under the same worker.
  • As the number of processes increases, the QAT accelerator card reaches a bottleneck and tends to be stable. Under 56 workers, the maximum qps can reach 4.4w.

The performance improvement brought by this optimization solution mainly depends on:

  • QAT uses a user-mode driver to achieve zero copy from kernel mode to user mode memory.
  • VUA uses asynchronous mode to call OpenSSL API instead of traditional synchronous mode.
  • The QAT driver supports multiple accelerator cards to perform offload acceleration simultaneously.

4.2 QAT_SW

This solution uses the icelake 6330 model (supporting AVX512 instruction set) for testing and uses RSA certificates for HTTPS encryption and decryption.

(1) Test method

The execution machine deploys the VUA adapted to the instruction set optimization, and the packet test machine performs stress testing and packet injection. After the CPU load reaches 100%, the new QPS comparison of the VUA after the instruction set optimization is obtained.

(2) Test network

(3) Comparison of local test data

Performance comparison using instruction set optimization

Instruction set optimization solution, HTTPS traffic service is tested through VUA, and compared with the scenario of using openssl software encryption and decryption:

  • With instruction set optimization, the average QPS of RSA is increased by 1 times under the same worker.
  • As the number of processes increases, instruction set optimization acceleration increases linearly, and can reach up to 51,000 qps with 56 workers.

The performance improvement brought by this optimization solution mainly depends on:

  • Use AVX512 instructions to optimize encryption and decryption

V. Summary and Thoughts

Up to now, vivo VLB has supported both Exar acceleration cards and Intel QAT hardware and software instruction set acceleration solutions in the field of software and hardware acceleration, successfully realizing autonomous control of core network components and laying a solid foundation for building a high-performance gateway architecture to empower the industry.

In the future, vivo VLB will continue to build an access layer gateway capability system.

  • Security and Compliance
    As vivo's unified traffic access portal, VLB will continue to build a secure and reliable communications security infrastructure and create a comprehensive security protection system.
  • Multi-protocol support
    VLB will continue to invest in the construction of efficient access capabilities and will improve the user experience in weak network scenarios by introducing the QUIC protocol.
    The MQTT protocol can enable the access of new devices and protocols at a very low access cost, actively embracing the Internet of Everything.

<<:  Choosing the right communication mode for your IoT project

>>:  Why you don't understand HTTPS

Recommend

Eight SEO optimization tools you must master in 2019

For cross-border e-commerce sellers in 2019, the ...

5G UPF traffic distribution technology and deployment methods

Labs Guide The User Plane Function (UPF) is an im...

EtherNetservers: $12/year-1GB/30GB/2TB/2IP/Los Angeles

EtherNetservers is a foreign VPS hosting company ...

Latest report: Is 5G private network a hero or a monster?

Whose product is 5G private network? A new report...

The 5G Spring Festival Gala is coming, but who will ensure its security?

The Chinese New Year is getting closer and closer...

Gartner: China's IT spending is expected to grow 7.7% in 2021

According to the latest forecast by Gartner, the ...

"No products, no discounts, no sales" Huawei's new "knowledgeable" approach

Not long ago, an online experience store with &qu...

Innovative ICT to build a smart airport

In the era of globalization, airports have become...

HTTPS protocols: TLS, SSL, SNI, ALPN, NPN

HTTPS is now widely used. While it brings securit...