On June 23, 2021, the Cloud Native Computing Foundation (CNCF) announced that it had accepted KubeDL as a CNCF Sandbox project through a global TOC vote. KubeDL is Alibaba's open source Kubernetes-based AI workload management framework, which is an abbreviation of " Kubernetes - Deep -Learning". It hopes to rely on Alibaba's scenarios to feed back the experience of large-scale machine learning job scheduling and management to the community. Project address: http://kubedl.io Project Introduction With the continuous maturity of mainstream AI frameworks such as TensorFlow, PyTorch, and XGBoost, and the explosive emergence of various AI heterogeneous computing chips represented by GPU/TPU, artificial intelligence is rapidly entering the stage of "large-scale industrialization". From the algorithm engineer designing the first layer of neural network structure to the final online service for real application scenarios, in addition to the research and development of AI algorithms, a lot of system support at the infrastructure level is required, including data collection and cleaning, distributed training engines, resource scheduling and orchestration, model management, inference service tuning, observability, etc. As shown in the following classic example, the collaboration of many system components constitutes a complete machine learning pipeline. At the same time, cloud-native technologies represented by Kubernetes are booming. Through excellent abstraction and powerful scalability, they perfectly decouple the application layer from the infrastructure of the IaaS (Infrastructure as a Service) layer: applications can use resources on demand in a "cloud" paradigm without having to worry about the complexity of the underlying infrastructure, thereby liberating productivity and focusing on innovation in their own fields. The emergence of Kubernetes has solved the problem of how to efficiently deliver cloud resources, but it still cannot provide good native support for highly complex workloads such as AI. The industry is still exploring and trying to integrate the differences between various frameworks and retain their universality, while building a series of complete surrounding ecosystems and tools around the runtime of AI workloads. In practice, we have found that AI workloads face the following challenges when running in the Kubernetes ecosystem: There are many machine learning frameworks, each with different optimization directions and applicable scenarios, but they have many commonalities in the lifecycle management of distributed training jobs, and they also have the same demands for some advanced features (such as network mode, image code separation, metadata persistence, cache acceleration, etc.). Operators are implemented separately for the load of each type of framework, and each independent process cannot share state. The lack of a global perspective makes it difficult to implement global job-level scheduling and queue mechanisms. In addition, it is not conducive to the abstraction and reuse of functions, and there is duplication of work at the code level. 1) Algorithm scientists usually deploy two or even multiple different versions of model instances at the same time to conduct A/B testing to verify the best service effect, which requires refined traffic control based on weights; 2) It can automatically trigger instance scaling based on traffic request levels and the metrics of the current inference service, minimizing resource costs while fully ensuring service availability. KubeDL To address the above challenges, Alibaba Cloud Native, Cluster Management, and PAI teams have consolidated their experience in managing large-scale machine learning workloads into a general runtime management framework, KubeDL, which covers all stages of the machine learning pipeline, including distributed training, model management, and inference services, enabling workloads to run efficiently on Kubernetes. 1. Distributed training KubeDL supports mainstream machine learning distributed training frameworks (TensorFlow/PyTorch/MPI/XGBoost/Mars, etc.). Mars is an open source tensor-based large-scale data computing framework of Alibaba's computing platform. It can accelerate the efficiency of data processing frameworks such as numpy and pandas in a distributed manner, helping Mars jobs to be integrated into the cloud-native big data ecosystem in a more native way. We abstract the common parts of the lifecycle management of various training jobs into a universal runtime library, which is reused by various distributed training job controllers. At the same time, users can also quickly expand customized workload controllers and reuse existing capabilities on this basis. With the help of declarative APIs and Kubernetes network/storage models, KubeDL can apply for/reclaim computing resources, service discovery and communication between various Job Roles, and fail-over at runtime. The developer of the algorithm model only needs to declare the Job Role that the training depends on and their respective number of replicas, the number of computing resources/heterogeneous resources, etc., and then submit the task. In addition, we have also made many feature designs to improve the efficiency and experience of training in response to the pain points in the training field: Different training frameworks often contain different Job Roles, such as PS/Chief/Worker in TensorFlow and Master/Worker in PyTorch. There are often implicit dependencies between Roles. For example, Worker can only start computing normally after Master is started. The disordered startup order can not only easily cause resources to idle for a long time, but may even cause Job to fail directly. KubeDL has designed a scheduling and orchestration control flow based on DAG (Direct Acyclic Graph), which solves the startup dependency order between Roles well and can be flexibly expanded. At the same time, we also provide a management and control suite: KubeDL-Dashboard. Users do not need to understand the many APIs of Kubernetes and struggle with various kubectl commands. They can get started with simple and easy-to-use machine learning jobs through the interface. Persistent metadata can also be directly consumed by the Dashboard. The Dashboard provides simple job submission, job management, event/log viewing, cluster resource view and other functions, helping machine learning users get started with experiments with a very low learning threshold. 2. Inference service specification tuning The development and maturity of GPU virtualization and time-sharing multiplexing technologies have given us the opportunity to run multiple inference services on a single GPU at the same time, significantly reducing costs. However, how to choose the right GPU resource specifications for inference services, especially incompressible video memory resources, has become a key problem. On the one hand, frequent model iterations leave algorithm engineers no time to accurately estimate the resource requirements of each model, and the dynamic changes in traffic also make resource evaluation inaccurate. Therefore, they tend to configure more GPU resource redundancy, sacrificing the latter between stability and efficiency, resulting in a lot of waste; on the other hand, since machine learning frameworks such as Tensorflow tend to occupy all idle video memory, from the perspective of cluster managers, estimating the resource requirements of inference services based on the historical usage of video memory is also very inaccurate. In the KubeDL-Morphling component, we have implemented automatic specification tuning of inference services. Through active stress testing, we perform performance profiling of services under different resource configurations, and finally give the most appropriate container specification recommendations. The profiling process is highly intelligent: To avoid exhaustive sampling of specification points, we use Bayesian optimization as the internal core driver of the profiling sampling algorithm. By continuously refining the fitting function, we provide near-optimal container specification recommendation results with a low sampling rate (<20%) stress testing overhead. 3. Model management and reasoning services The model is the product of training, and is the concentrated essence of the combination of computing and algorithms. The usual way to collect and maintain the model is to host it on cloud storage and achieve unified management by organizing the file system. This management method relies on strict process specifications and permission control, and does not achieve immutability of model management from the system level. The birth of container images solves the problems of RootFS construction, distribution, and immutability. KubeDL combines the two to achieve image-based model management. After the training is successfully completed, the construction of the model image will be automatically triggered by the ModelVersion specified in the Job Spec. Users can agree on the storage path of the model, the target image registry and other basic information in ModelVersion.Spec, and push each training output to the corresponding image repository. At the same time, the image is used as the output of training and the input of the inference service, which connects the two stages well and realizes the complete machine learning pipeline of distributed training->model building and management->inference service deployment. KubeDL provides Inference resource objects to provide deployment and runtime control of inference services. A complete Inference service can be composed of a single or multiple Predictors. Each Predictor corresponds to the model output by the previous training. The model will be automatically pulled and mounted to the main container Volume. When multiple Predictors of different model versions coexist, the traffic can be distributed and controlled according to the assigned weights to achieve the control experiment effect of A/B Test. In the future, we will also do more exploration on Batching batch inference and AutoScale for inference service scenarios. Practice of KubeDL distributed training on public cloud PAI-DLC As cloud computing becomes more popular and more businesses are being run in a cloud-native way, the Alibaba Cloud Computing Platform PAI machine learning team has launched DLC (Deep Learning Cloud), a deep learning platform product. DLC uses a new cloud-native architecture, with Kubernetes as the underlying resource base support, and KubeDL is fully used for training management. It is a large-scale practice of KubeDL in deep learning cloud computing scenarios. DLC has widely supported many businesses within the Alibaba Group, including the deep learning computing needs of many business departments such as the image and video, natural language, voice, multimodal understanding, and autonomous driving of the Taobao Security Department and the DAMO Academy. In serving the cutting-edge business production driven by deep learning, the PAI team has accumulated a lot of experience in framework and platform construction, and has accumulated framework platform capabilities that are compatible with the community (eg, TensorFlow/PyTorch) and have distinctive features and have been practiced in large-scale industries, such as the training of the M6 model with trillion-scale parameters, the industrial-grade graph neural network system Graph-Learn, and the ultimate resource management and reuse capabilities. Today, PAI-DLC's capabilities are also fully embracing the public cloud, providing developers and enterprises with a cloud-native one-stop deep learning training platform, a flexible, stable, easy-to-use and high-performance machine learning training environment, and full support for multiple communities and PAI deep-optimized algorithm frameworks, high-performance and stable operation of ultra-large-scale distributed deep learning tasks, reducing costs and increasing efficiency for developers and enterprises. As a manifestation of the best practices of Alibaba Group's machine learning platform, the public cloud DLC has absorbed valuable experience from engineering practice in terms of product details, framework optimization, platform services, etc. In addition, the DLC product fully considered the unique attributes of the public cloud scenario at the beginning of its design, providing functions such as bidding instances, automatic fail-over, and elastic scaling, striving to reduce AI computing costs for customers. Furthermore, DLC is also combined with PAI's other public cloud products, such as DSW for modeling by algorithm engineers, automated AutoML for the entire enterprise AI process, and EAS, an online reasoning service, to create a benchmark AI product for the entire process. |
<<: Afen teaches you to avoid the pitfalls of installing RabbitMQ (command practice)
>>: Network acceleration, who will revolutionize the future?
DogYun is a Chinese hosting company established i...
5G has been in commercial use for more than a yea...
800G optical modules have entered mass production...
In 2020, my country's large-scale 5G construc...
With over 250 million students, India has one of ...
Colleges and universities have always been an imp...
The world is so big, thank you for visiting me!! ...
1. About safety 1. Connected cars are more than j...
【51CTO.com original article】 Today, with just the...
[[336158]] This article is reprinted from the WeC...
V5.NET has launched a new promotion, currently of...
On August 9, 2019, Huawei held the "Unleashi...
Google says demand for 3.5 GHz Citizens Broadband...
Recently, with the "Xinzhou Public Trading P...
With the continuous development of enterprise IT ...