EMR on ACK is newly released to help enterprises efficiently build big data platforms

EMR on ACK is newly released to help enterprises efficiently build big data platforms

Alibaba Cloud EMR on ACK provides users with a new way to build a big data platform. Users can deploy open source big data services on Alibaba Cloud Container Service (ACK). Taking advantage of ACK's service deployment and high-performance and scalable container application management capabilities, users only need to focus on the big data jobs themselves. Users can easily execute Spark, Presto, and Flink jobs on the ACK cluster, which is 100% compatible with open source and has better performance than open source.

1. Background

Technology Trends

Separation of storage and computing, evolution towards cloud native Online business, AI, and big data are uniformly connected to the ACK cluster, peak-shifting scheduling, offline and online co-location, and improved machine utilization Unified operation and maintenance entry, unified operation and maintenance tool chain, and unified monitoring system Cluster-centric -> job-centric Multi-version support, for example, Spark 2.x and Spark 3.x can be run at the same time

Cloud Native Faces Challenges

Computing and storage separation: How to build an HCFS file system based on object storage OSS

Need to be fully compatible with the existing HDFS

Performance comparable to HDFS, with lower costs

Computing engine shuffle Data storage and computing separation: How to solve ACK hybrid heterogeneous models

Heterogeneous models do not have local disks

Community [Spark-25299] discussed and supported Spark dynamic resources, which became an industry consensus

ACK Scheduling Capabilities: How to Solve Scheduling Performance Bottlenecks

Performance benchmarking Yarn

Multi-level queue management

Peak-shifting scheduling

Leveraging the capabilities of the K8s operating system to orchestrate the peaks and troughs of various businesses

Advantages of EMR on ACK

Remote Shuffle Service provides a storage and computing separation solution for intermediate shuffle data

It can make computing nodes without local disk and cloud disk

Supports enabling Spark dynamic resource function, the ultimate solution for Spark-25299

JindoFS provides lake acceleration solutions for OSS storage

Block mode 1TB TPCDS scenario has more than 15% performance improvement

The scheduling layer supports Scheduler Framework V2

Scheduling performance is more than 3x higher than that of the community

Provide multi-level queue management

Engine Capability Enhancement

In the 10TB TPCDS Benchmark scenario, EMR Spark has a 3x performance improvement over the community

Hudi and DeltaLake have enhanced performance compared to community functions

Complete peak-shifting scheduling solution

2. EMR containerized architecture

EMR on ACK Architecture

Lightweight management and control, connecting to existing data platforms, submitting to different execution platforms through data development clusters/scheduling platforms for peak-shift scheduling, adjusting the cloud-native data lake architecture according to business peak and off-peak strategies, ACK has strong elastic expansion and contraction capabilities
ACK manages heterogeneous clusters with good flexibility

3. Product Introduction

Product Home

Reference link: https://www.aliyun.com/product/emapreduce

Create a new cluster

Region: Currently open to Hangzhou, Shanghai, Beijing, Shenzhen and other regions (continuously open)
Cluster type: Spark, Shuffle Service, Presto
Spark — a general-purpose distributed big data processing engine that provides ETL, offline batch processing, data modeling, and other capabilities

Shuffle Service — Provides optimized Shuffle service for EMR computing engine to solve the dependency problem on local disks under Kubernetes

Solve the network and disk IO bottlenecks of large-scale computing clusters

Supports computing and storage separation architecture and can serve multiple EMR clusters

Presto — A distributed SQL interactive query engine based on memory that supports multiple data sources

Suitable for complex analysis of PB-level massive data and cross-data source queries

Component version: Spark (3.1.1)
Dedicated nodes:
Existing ACK cluster, share some nodes to EMR

Create a new ACK cluster and select the entire cluster as a dedicated node

OSS Bucket: used to store jobs, logs, jar packages, and other information

Cluster Management

Cluster ID/Name: Click to enter job management

Cluster status: Check whether the cluster is available. ACK cluster: Can be associated with an existing ACK. Cluster configuration: Spark job configuration. Release: Release space.

<<:  Looking at the future from the perspective of performance, how will operators enter the second half of the 5G competition?

>>:  5G has no presence? Wrong! It has already "bloomed in many places"

Recommend

Will 5G be the next disruptive technology?

The telecommunications industry likes to use the ...

How to Choose Brite Box and White Box Switches for Your Network

In the ever-evolving network infrastructure lands...

What is the difference between FTP and SFTP?

In actual project development, the most commonly ...

How 5G will shape the future of construction

5G is an enabler that will deliver new capabiliti...

F5 Named a Leader in WAF by Independent Research Firm Forrester Research

F5 Networks (NASDAQ: FFIV) today announced that i...

Let’s talk about Sentinel Quick Start

[[397765]] This article is reprinted from the WeC...

SDN and NFV: Technology implementation and commercial deployment in full swing

On June 1-2, 2016, with the support of China SDN ...

10,000-word article on DNS protocol!

[[376851]] Consider this question: how many ways ...