EMR on ACK is newly released to help enterprises efficiently build big data platforms

Alibaba Cloud EMR on ACK provides users with a new way to build a big data platform. Users can deploy open source big data services on Alibaba Cloud Container Service (ACK). Taking advantage of ACK's service deployment and high-performance and scalable container application management capabilities, users only need to focus on the big data jobs themselves. Users can easily execute Spark, Presto, and Flink jobs on the ACK cluster, which is 100% compatible with open source and has better performance than open source.

1. Background

Technology Trends

Separation of storage and computing, evolution towards cloud native Online business, AI, and big data are uniformly connected to the ACK cluster, peak-shifting scheduling, offline and online co-location, and improved machine utilization Unified operation and maintenance entry, unified operation and maintenance tool chain, and unified monitoring system Cluster-centric -> job-centric Multi-version support, for example, Spark 2.x and Spark 3.x can be run at the same time

Cloud Native Faces Challenges

Computing and storage separation: How to build an HCFS file system based on object storage OSS

Need to be fully compatible with the existing HDFS

Performance comparable to HDFS, with lower costs

Computing engine shuffle Data storage and computing separation: How to solve ACK hybrid heterogeneous models

Heterogeneous models do not have local disks

Community [Spark-25299] discussed and supported Spark dynamic resources, which became an industry consensus

ACK Scheduling Capabilities: How to Solve Scheduling Performance Bottlenecks

Performance benchmarking Yarn

Multi-level queue management

Peak-shifting scheduling

Leveraging the capabilities of the K8s operating system to orchestrate the peaks and troughs of various businesses

Advantages of EMR on ACK

Remote Shuffle Service provides a storage and computing separation solution for intermediate shuffle data

It can make computing nodes without local disk and cloud disk

Supports enabling Spark dynamic resource function, the ultimate solution for Spark-25299

JindoFS provides lake acceleration solutions for OSS storage

Block mode 1TB TPCDS scenario has more than 15% performance improvement

The scheduling layer supports Scheduler Framework V2

Scheduling performance is more than 3x higher than that of the community

Provide multi-level queue management

Engine Capability Enhancement

In the 10TB TPCDS Benchmark scenario, EMR Spark has a 3x performance improvement over the community

Hudi and DeltaLake have enhanced performance compared to community functions

Complete peak-shifting scheduling solution

2. EMR containerized architecture

EMR on ACK Architecture

Lightweight management and control, connecting to existing data platforms, submitting to different execution platforms through data development clusters/scheduling platforms for peak-shift scheduling, adjusting the cloud-native data lake architecture according to business peak and off-peak strategies, ACK has strong elastic expansion and contraction capabilities
ACK manages heterogeneous clusters with good flexibility

3. Product Introduction

Product Home

Reference link: https://www.aliyun.com/product/emapreduce

Create a new cluster

Region: Currently open to Hangzhou, Shanghai, Beijing, Shenzhen and other regions (continuously open)
Cluster type: Spark, Shuffle Service, Presto
Spark — a general-purpose distributed big data processing engine that provides ETL, offline batch processing, data modeling, and other capabilities

Shuffle Service — Provides optimized Shuffle service for EMR computing engine to solve the dependency problem on local disks under Kubernetes

Solve the network and disk IO bottlenecks of large-scale computing clusters

Supports computing and storage separation architecture and can serve multiple EMR clusters

Presto — A distributed SQL interactive query engine based on memory that supports multiple data sources

Suitable for complex analysis of PB-level massive data and cross-data source queries

Component version: Spark (3.1.1)
Dedicated nodes:
Existing ACK cluster, share some nodes to EMR

Create a new ACK cluster and select the entire cluster as a dedicated node

OSS Bucket: used to store jobs, logs, jar packages, and other information

Cluster Management

Cluster ID/Name: Click to enter job management

Cluster status: Check whether the cluster is available. ACK cluster: Can be associated with an existing ACK. Cluster configuration: Spark job configuration. Release: Release space.

<<: Looking at the future from the perspective of performance, how will operators enter the second half of the 5G competition?

>>: 5G has no presence? Wrong! It has already "bloomed in many places"

95IDC: Hong Kong/Japan cloud hosting quarterly payment 50% off, starting from 75 yuan/quarter

EMR on ACK is newly released to help enterprises efficiently build big data platforms

95IDC: Hong Kong/Japan cloud hosting quarterly payment 50% off, starting from 75 yuan/quarter

Accelerating the development of edge computing

Ruijie Networks' scenario-based wireless technology helps Suning's new shopping model of "Internet + Retail"

Can 5G RedCap technology help operators regain confidence?

Changing the quality of cultural experience with 5G

Riverbed Launches Two New Visibility Solutions to Enable 360-degree View of Enterprise IT

What should you do if you forget the wireless router backend login address?

Operators won’t tell you that you can use the 5G network without a 5G package

OpLink: Houston 1Gbps unlimited traffic high-security VPS starting at $1 for the first month

Megalayer: Hong Kong/Philippines/US VPS annual payment starts from 159 yuan, CN2 line optimized bandwidth

Recommend

The main problems facing 5G networks

Snapchat QUIC Practice: Small Protocol Solve Big Problems

How much do you know about the development of Wi-Fi?

Huawei: Realizing a truly bright future for the Internet of Things

Difficulties and solutions faced by ONOS dynamic expansion

How is the ETag value in the HTTP response header generated?

What is number portability? What are the benefits? Is it necessary to port your number?

A must-have for 5G engineers! A complete list of 5G protocols

A leap from concept to practice! How does the cloud's native "immune system" fight organically?

In addition to speed, what are the key technologies of 5G?

What exactly is fast charging?

The three major telecom operators have begun to deploy 5G messaging on a large scale

iONcloud New Year 20% off, San Jose/Los Angeles/Dallas/Honolulu data centers, Linux/Windows options

IPv6 Basics: Neighbor Discovery Protocol NDP

Report shows 33% of enterprises plan to deploy Wi-Fi 7 by 2024