With the vigorous development of cloud-native technology and its increasingly mature industry implementation, machine learning on the cloud is rapidly advancing towards large-scale and industrialized development. Recently, Morphling, as one of the independent sub-projects of Alibaba's open source KubeDL, has become a Cloud Native Computing Foundation (CNCF) Sandbox project. It aims to provide automated deployment configuration tuning, testing, and recommendations for large-scale industrial deployment of machine learning model inference services. In the context of increasingly mature GPU virtualization and reuse technologies, it helps companies fully enjoy the advantages of cloud native, optimize the performance of online machine learning services, reduce service deployment costs, and efficiently solve the performance and cost challenges of machine learning in actual industrial deployment. In addition, the academic paper related to the Morphling project, "Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving", was accepted by the ACM Symposium on Cloud Computing 2021 (ACM SoCC 2021). Morphling is a character from the game Dota, which can flexibly change its form to optimize combat performance according to environmental requirements. Through the Morphling project, we hope to achieve flexible and intelligent deployment configuration changes for machine learning inference jobs, optimize service performance, and reduce service deployment costs. background The workflow of machine learning on the cloud can be divided into two parts: model training and model serving. After offline training, tuning and testing, the model will be deployed as an online application in the form of a container to provide users with uninterrupted high-quality reasoning services, such as target object recognition in online live videos, online language translation tools, and online image classification. For example, Alibaba's internal Taobao content social platform Machine Vision Application Platform (MVAP) supports Taobao live product highlight recognition, live cover image deduplication, and browsing image and text classification through an online machine learning reasoning engine. According to Intel data, the era of large-scale reasoning ("Inference at Scale") is coming: by 2020, the ratio of reasoning to training cycles will exceed 5:1; Amazon data shows that in 2019, Amazon AWS's infrastructure spending on model reasoning services accounted for more than 90% of its total machine learning task spending. Machine learning reasoning has become the key to the implementation and "monetization" of artificial intelligence. Inference tasks on the cloud The inference service itself is a special long-running microservice form. With the increasing deployment volume of inference services on the cloud, its cost and service performance have become critical optimization indicators. This requires the operation and maintenance team to perform reasonable configuration optimization on the inference container before deployment, including hardware resource configuration, service operation parameter configuration, etc. These optimized configurations play a vital role in coordinating service performance (such as response time, throughput) and resource utilization efficiency. In practice, our tests found that different deployment configurations can lead to a difference of up to ten times in throughput/resource utilization. Relying on Alibaba's extensive experience in AI inference services, we first summarized the inference business. Compared with the configuration of traditional service deployment, it has the following characteristics: Using expensive graphics card resources, but low video memory usage: The development and maturity of GPU virtualization and time-sharing multiplexing technology give us the opportunity to run multiple inference services on a single GPU at the same time, significantly reducing costs. Unlike training tasks, inference tasks use a well-trained neural network model to process user input information through a neural network to obtain output. The process only involves the forward propagation of the neural network, and the demand for video memory resources is low. In contrast, the training process of the model involves the backward propagation of the neural network, which requires the storage of a large number of intermediate results, and puts much more pressure on the video memory. Our large amount of cluster data shows that allocating an entire graphics card to a single inference task will cause a considerable degree of resource waste. However, how to choose the right GPU resource specifications for inference services, especially incompressible video memory resources, has become a key problem. Optimizing the inference service deployment configuration Cloud native technologies, with Kubernetes as the mainstream, are being widely used in new application loads in a variety of forms. Building machine learning tasks (including training and reasoning) on Kubernetes and achieving stable, efficient, and low-cost deployment have become the focus and key for major companies to promote AI projects and cloud services. The industry is still constantly exploring and experimenting with the configuration of reasoning containers under the Kubernetes framework. The most common mode is to manually configure parameters based on human experience, which is simple but inefficient. The actual situation is often that service deployers, from the perspective of cluster managers, tend to configure more resource redundancy to ensure service quality, choosing to sacrifice stability over efficiency, resulting in a large waste of resources; or directly use the default values for operating parameters, losing performance optimization opportunities. Relying on Alibaba's extensive experience in AI inference services, we have found that the pain points of inference service configuration and tuning are: Lack of framework for automated performance testing and parameter tuning: Iterative manual adjustment of configuration - service stress testing brings huge manual burden to deployment testing, making this direction an impossible option in reality. Morphling To address the above challenges, Alibaba's cloud-native cluster management team developed and open-sourced a Kubernetes-based machine learning inference service configuration framework, Morphling, which automates the entire process of parameter combination tuning and combines it with efficient intelligent tuning algorithms to enable the configuration tuning process of the inference business to run efficiently on Kubernetes, solving the performance and cost challenges of machine learning in actual industry deployment. Morphling abstracts the parameter tuning process at different levels of cloud native abstraction, providing users with a simple and flexible configuration interface, and encapsulating the underlying container operations, data communication, sampling algorithms, and storage management in the controller. Specifically, Morphling's parameter tuning-performance stress testing uses an experiment-trial workflow. Experiment is the most user-friendly abstraction layer. Through interaction, users can specify the storage location of the machine learning model, the configuration parameters to be tuned, the upper limit of the number of tests, etc., and define a specific parameter tuning job. Through this iterative sampling and testing, optimized configuration combination recommendations are finally fed back to business deployment personnel. At the same time, Morphling provides a management and control suite: Morphling-UI, which allows the business deployment team to initiate inference business configuration tuning experiments, monitor the tuning process, and compare the tuning results through simple and easy-to-use operations on the interface. The practice of Morphling in Taobao content social platform Alibaba's rich online machine learning reasoning scenarios and a large number of reasoning service instance requirements provide first-hand implementation practice and test feedback for Morphling's implementation verification. Among them, the Alibaba Taobao content social platform Machine Vision Application Platform (MVAP) team supports Taobao live broadcast product highlight recognition, live broadcast cover image deduplication, browsing image and text classification and other services through online machine learning reasoning engines. During the Double Eleven shopping festival in 2020, we used Morphling to perform specification testing and optimization on the AI inference container to find the optimal solution between performance and cost. At the same time, the algorithm engineering team conducted targeted model quantification and analysis on these resource-intensive inference models, such as the Taobao video viewing service, and optimized them from the perspective of AI model design, supporting the peak traffic of Double Eleven shopping festival with minimal resources while ensuring that the performance of the business did not decline, greatly improving GPU utilization and reducing costs. Academic Exploration In order to improve the efficiency of the inference service parameter tuning process, Alibaba Cloud Native Cluster Management Team, based on the characteristics of the inference business, further explored the use of meta-learning and few-shot regression to achieve a more efficient and low-sampling cost configuration tuning algorithm to meet the actual industry's "fast, small sample sampling, low test cost" tuning requirements, as well as a cloud-native and automated tuning framework. The related academic paper "Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving" was accepted by ACM Symposium on Cloud Computing 2021 (ACM SoCC 2021). In recent years, topics related to the optimized deployment of AI reasoning tasks on the cloud have been active in major academic journals and conferences related to cloud computing and systems, and have become a hot topic for academic exploration. The topics explored mainly include dynamic selection of AI models, dynamic expansion and contraction of deployment instances, traffic scheduling for user access, and full utilization of GPU resources (such as dynamic model loading and batch size optimization). However, this is the first time that research has been conducted on the issue of optimizing container-level reasoning service deployment based on large-scale industry practices. In terms of algorithms, performance tuning is a classic hyper-parameter tuning problem. Traditional hyper-parameter tuning methods, such as Bayesian optimization, are difficult to deal with tuning problems with high dimensions (multiple configuration items) and large search spaces. For example, for AI reasoning tasks, we perform "combination optimization" hyper-parameter tuning in four dimensions (configuration items): number of CPU cores, GPU memory size, batch size, and GPU model. Each configuration item has 5 to 8 optional parameters. In this way, the parameter search space in the combination case is as high as more than 700. Based on our accumulated experience in testing production clusters, for an AI reasoning container, each test of a set of parameters takes several minutes from launching the service, stress testing, to data reporting; at the same time, AI reasoning services are of various types, with frequent updates and iterations, limited deployment engineers, and limited test cluster costs. In order to efficiently test the optimal configuration parameters in such a large search space, new challenges have been posed to the hyper-parameter tuning algorithm. In this paper, our core observation is that for different AI reasoning services, the impact of various configurations that need to be optimized (such as GPU memory and batch size) on the service performance of the container (such as QPS) is "stable and similar", which is reflected in the visualized "configuration-performance" surface. For different AI reasoning instances, the shapes of the "configuration-performance" surfaces are similar, but the degree of impact of the configuration on performance and the key nodes are different in terms of values: The figure above visualizes the impact of the two-dimensional configuration of <CPU core count, GPU memory size> on the container service throughput RPS of three AI reasoning models. The paper proposes to use Model-Agnostic Meta-Learning (MAML) to learn these commonalities in advance and train a meta-model, so as to quickly find the key nodes in the surface for new AI reasoning performance tests, and make accurate fits under small samples (5%) based on the meta-model. Summarize Morphling is based on the Kubernetes machine learning inference service configuration framework, combined with the "fast, small sample sampling, low testing cost" tuning algorithm, to achieve an automated, stable and efficient AI inference deployment tuning process for cloud native, enabling faster optimization and iteration of the deployment process, and accelerating the launch of machine learning business applications. The combination of Morphling and KubeDL will also make the configuration tuning experience of AI from model training to inference deployment smoother. |
<<: What will happen to 5G base stations if power is cut off? Can operators handle the performance?
>>: Hyper-converged data center network CloudFabric 3.0 builds a new data center network foundation
2019 is the first year of 5G commercial deploymen...
In order to accelerate the industry's quality...
On June 6, 2019, a very auspicious day, the Minis...
[Original article from 51CTO.com] According to Bl...
We won’t talk about HTTP and HTTPS first. Let’s s...
"Use data nationwide as you wish", &quo...
How to speed up git clone Do you often use git to...
Not long ago, you had to choose between a wired o...
[[411113]] According to the latest report "C...
Since the commercialization of 5G networks, the s...
Today I bring you a fun thing to crack WiFi passw...
[51CTO.com original article] 2017 has quietly pas...
If 2019 is the first year of Wi-Fi 6 commercializ...
According to a report by the Financial Times (FT....
[51CTO.com original article] On December 18, 2019...