Alibaba Cloud EMR on ACK provides users with a new way to build a big data platform. Users can deploy open source big data services on Alibaba Cloud Container Service (ACK). Taking advantage of ACK's service deployment and high-performance and scalable container application management capabilities, users only need to focus on the big data jobs themselves. Users can easily execute Spark, Presto, and Flink jobs on the ACK cluster, which is 100% compatible with open source and has better performance than open source. 1. Background Technology Trends Separation of storage and computing, evolution towards cloud native Online business, AI, and big data are uniformly connected to the ACK cluster, peak-shifting scheduling, offline and online co-location, and improved machine utilization Unified operation and maintenance entry, unified operation and maintenance tool chain, and unified monitoring system Cluster-centric -> job-centric Multi-version support, for example, Spark 2.x and Spark 3.x can be run at the same time Cloud Native Faces Challenges Computing and storage separation: How to build an HCFS file system based on object storage OSS Need to be fully compatible with the existing HDFS Performance comparable to HDFS, with lower costs Computing engine shuffle Data storage and computing separation: How to solve ACK hybrid heterogeneous models Heterogeneous models do not have local disks Community [Spark-25299] discussed and supported Spark dynamic resources, which became an industry consensus ACK Scheduling Capabilities: How to Solve Scheduling Performance Bottlenecks Performance benchmarking Yarn Multi-level queue management Peak-shifting scheduling Leveraging the capabilities of the K8s operating system to orchestrate the peaks and troughs of various businesses Advantages of EMR on ACK Remote Shuffle Service provides a storage and computing separation solution for intermediate shuffle data It can make computing nodes without local disk and cloud disk Supports enabling Spark dynamic resource function, the ultimate solution for Spark-25299 JindoFS provides lake acceleration solutions for OSS storage Block mode 1TB TPCDS scenario has more than 15% performance improvement The scheduling layer supports Scheduler Framework V2 Scheduling performance is more than 3x higher than that of the community Provide multi-level queue management Engine Capability Enhancement In the 10TB TPCDS Benchmark scenario, EMR Spark has a 3x performance improvement over the community Hudi and DeltaLake have enhanced performance compared to community functions Complete peak-shifting scheduling solution 2. EMR containerized architecture EMR on ACK Architecture Lightweight management and control, connecting to existing data platforms, submitting to different execution platforms through data development clusters/scheduling platforms for peak-shift scheduling, adjusting the cloud-native data lake architecture according to business peak and off-peak strategies, ACK has strong elastic expansion and contraction capabilities 3. Product Introduction Product Home Reference link: https://www.aliyun.com/product/emapreduce Create a new cluster Region: Currently open to Hangzhou, Shanghai, Beijing, Shenzhen and other regions (continuously open) Shuffle Service — Provides optimized Shuffle service for EMR computing engine to solve the dependency problem on local disks under Kubernetes Solve the network and disk IO bottlenecks of large-scale computing clusters Supports computing and storage separation architecture and can serve multiple EMR clusters Presto — A distributed SQL interactive query engine based on memory that supports multiple data sources Suitable for complex analysis of PB-level massive data and cross-data source queries Component version: Spark (3.1.1) Create a new ACK cluster and select the entire cluster as a dedicated node OSS Bucket: used to store jobs, logs, jar packages, and other information Cluster Management Cluster ID/Name: Click to enter job management Cluster status: Check whether the cluster is available. ACK cluster: Can be associated with an existing ACK. Cluster configuration: Spark job configuration. Release: Release space. |
>>: 5G has no presence? Wrong! It has already "bloomed in many places"
iOVZ Cloud has launched a regular promotion for M...
The telecommunications industry likes to use the ...
On November 27, the National Public Security Orga...
In the ever-evolving network infrastructure lands...
Since the beginning of this year, 5G has become t...
In actual project development, the most commonly ...
5G is an enabler that will deliver new capabiliti...
Ever since Kevin Kelly predicted the bright prosp...
F5 Networks (NASDAQ: FFIV) today announced that i...
It has been a year since 5G was officially put in...
[[397765]] This article is reprinted from the WeC...
5G commercialization is getting closer and closer...
On June 1-2, 2016, with the support of China SDN ...
[[376851]] Consider this question: how many ways ...
On March 6, 2022, the "Attack and Defense Co...