Does Snowflake's popularity mean Hadoop is dead? What exactly is a big data system?

Any technology will go through a process from highbrow to lowbrow, just like our understanding of computers has changed from "a computer room that can only be entered with shoe covers" to smartphones that can be seen everywhere. In the past 20 years, big data technology has also gone through such a process, from the once high-end "rocket science" to a technology that is accessible to everyone.

Looking back, in the early days of big data development, many open source and self-developed systems emerged, and a long period of "red ocean" competition was launched in the same field, such as Yarn VS Mesos, Hive VS Spark, Flink VS SparkStreaming VS Apex, Impala VS Presto VS Clickhouse, etc. After fierce competition and elimination, the winning products gradually scaled up and began to occupy the market and developers.

In fact, in recent years, there have been no new star open source engines in the big data field (Clickhouse@open source in 2016, PyTorch@open source in 2018). Represented by the discontinuation of maintenance of projects such as Apache Mesos, the big data field has entered the "post-red ocean" era: technology has begun to gradually converge and entered a stage of technology popularization and large-scale business application.

The author of this article, Guan Tao, is a senior expert in the field of big data systems. He has experienced the last 15 years of the 20-year development of big data at Microsoft (Internet/Azure Cloud Group) and Alibaba (Alibaba Cloud). This article attempts to provide an overview of the hot topics of big data architecture, the development context of each technology line, as well as technology trends and unresolved issues from the perspective of system architecture.

It is worth mentioning that the field of big data is still in its development stage. Some technologies are converging, but new directions and new fields are emerging one after another. The content of this article is related to my personal experience and is a personal perspective. It is inevitable that there are omissions or biases. At the same time, due to the limited length, it is difficult to be comprehensive. It is only a starting point, and I hope to discuss with my peers.

1. Current hot topics in big data systems

The concept of BigData was proposed in the 1990s. With the foundation laid by Google's three classic papers (GFS, BigTable, MapReduce), it has been developed for nearly 20 years. In the past 20 years, excellent systems have been born, including Google's big data system, Microsoft's Cosmos system, Alibaba Cloud's FeiTian system, and the open source Hadoop system. These systems have gradually pushed the industry into the era of "digitalization" and later "AI".

The massive amount of data and the value it contains have attracted a lot of investment and greatly promoted the technology in the field of big data. The rise of cloud has made big data technology readily available to small and medium-sized enterprises. It can be said that the development of big data technology is just in time.

From the perspective of system architecture, the evolution of the "Shared-Everything" architecture, the integrated integration of lake warehouse technology, the basic design upgrades brought by cloud native, and better AI support are the four hot spots of current platform technology.

1.1 From the perspective of system architecture, the platform as a whole evolves towards a Shared-Everything architecture

The system architecture in the pan-data field has evolved from the scale-up of traditional databases to the scale-out of big data. From the perspective of distributed systems, the overall architecture can be divided into three types: Shared-Nothing (also known as MPP), Shared-Data, and Shared-Everything.

The data warehouse system of the big data platform was originally developed from the database, and the Shared-Nothing (also known as MPP) architecture has been the mainstream for a long time. With the enhancement of cloud native capabilities, Shared-Data represented by Snowflake has gradually developed. The big data system based on the principles of DFS and MapReduce was designed with the Shared-Everything architecture from the beginning.

Representatives of the Shared-Everything architecture are Google BigQuery and Alibaba Cloud MaxCompute. From an architectural perspective, the Shared-Everything architecture has better flexibility and potential and will be the direction of future development.

(Figure: Three types of big data system architecture)

1.2 From the perspective of data management, the data lake and data warehouse are integrated to form a lake-warehouse integration

The high performance and management capabilities of data warehouses and the flexibility of data lakes are mutually learning from and integrating with each other. In 2020, various vendors proposed the integrated lake-warehouse architecture, which has become the hottest trend in the current architecture evolution. However, the integrated lake-warehouse architecture has many forms, and different forms are still evolving and being debated.

(Figure: Data lake and data warehouse reference integration)

1.3 From the perspective of cloud architecture, cloud native and managed services have become mainstream

As big data platform technology enters the deep water zone, users are also beginning to diverge. More and more small and medium-sized users no longer develop or build their own data platforms, but begin to embrace fully managed (usually cloud-native) data products. Snowflake, as a typical product in this field, has been widely recognized. Looking to the future, only a small number of super-large-scale leading companies will adopt the self-built (open source + improved) model.

(Figure: Snowflake's cloud-native architecture)

1.4 From the perspective of computing mode, AI gradually becomes mainstream, forming a BI+AI dual mode

BI, as a statistical analysis type of computing, is mainly aimed at summarizing the past; AI computing has an increasingly better ability to predict the future. In the past five years, the load of algorithms has increased from less than 5% of the total capacity of data centers to 30%. AI has become a first-class citizen in the field of big data.

2. Domain Architecture of Big Data System

In the three architectures of Shared-Nothing, Shared-Data, and Shared-Everything introduced in the previous article (#1.1), the two systems I have experienced (Microsoft Cosmos/Scope system and Alibaba Cloud MaxCompute) are both Shared-Everything architectures. Therefore, from the perspective of Shared-Everything architecture, I divide the big data field into 6 overlapping sub-fields and 3 horizontal fields, a total of 9 fields, as shown in the following figure.

(Figure: Domain architecture based on the Shared-Everything big data system)

After years of development, each field has made certain progress and accumulation. The following chapters will outline the evolution history, driving force behind it, and development direction of each sub-field.

2.1 Distributed storage evolves towards multi-layer intelligence

Distributed storage, in this article, specifically refers to general-purpose big data massive distributed storage, which is a typical stateful distributed system. High throughput, low cost, disaster tolerance, and high availability are the core optimization directions. (Note: The following generational division is only for the convenience of explanation and does not represent a strict architecture evolution.)

The first generation of distributed storage is represented by Google's GFS and Apache Hadoop's HDFS, both of which are append-only file systems that support multiple backups. Because the early NameNode of HDFS had shortcomings in scalability and disaster recovery and could not fully meet users' requirements for high data availability, many large companies have developed their own storage systems, such as Microsoft's Cosmos (later evolved into Azure Blob Storage) and Alibaba's Pangu system. As the foundation of open source storage, HDFS's interface has become a de facto standard, and HDFS also has the plug-in capability to support other systems as the underlying storage system.

The second generation, based on the above chassis, with the surge in demand for massive object storage (such as massive photos), encapsulates a metadata service layer that supports massive small objects on top of the general Append-only file system to form object-based storage (Object-based Storage) , typical representatives include AWS S3 and Alibaba Cloud OSS. It is worth mentioning that both S3 and OSS can be used as standard plug-ins to become the actual storage backend of HDFS.

The third generation is represented by data lakes. With the development of cloud computing technology and the advancement of network technology (after 2015), the integrated storage and computing architecture has gradually been replaced by a new architecture of cloud native storage (storage hosting) + storage and computing separation . This is also the starting point of the data lake system. At the same time, the bandwidth performance problem caused by the separation of storage and computing has not been completely solved, and cache services such as Alluxio were born in this niche field.

The fourth generation is also the current trend. With the cloud hosting of storage, the underlying implementation is transparent to users, so the storage system has the opportunity to develop in a more complex design direction, thus beginning to evolve towards a multi-layer integrated storage system . From a single SATA disk-based system, it has evolved to a multi-layer system such as Mem/SSD+SATA (3X backup)+SATA (EC backup represented by 1.375X)+ice storage (typically represented by AWS Glacier).

How to intelligently/transparently tier data storage and find the trade-off between cost and performance is a key challenge for multi-tier storage systems. This field has just started, and there are no outstanding products in the open source field. The best level is led by the self-developed data warehouse storage systems of several large manufacturers.

(Figure: Alibaba MaxCompute's multi-layer integrated storage system)

Above the above system, there is a file storage format layer (File Format layer), which is orthogonal to the storage system itself.

The first generation of storage formats includes file formats, compression and encoding technologies, and Index support. Currently, the two mainstream storage formats are Apache Parquet and Apache ORC, which come from the Spark and Hive ecosystems respectively. Both are columnar storage formats suitable for big data . ORC has advantages in compression and encoding, while Parquet is better in semi-structure support. In addition, there is another memory format, Apache Arrow, which is also a format, but its design system is mainly optimized for memory exchange.

The second generation of storage formats - near real-time storage formats represented by Apache Hudi/Delta Lake . In the early days of storage formats, it was a large file column storage mode, which was optimized for throughput (not latency). With the trend of real-time, the two mainstream storage modes mentioned above have evolved to support real-time. Databricks launched Delta Lake to support Apache Spark for near-real-time data ACID operations; Uber launched Apache Hudi to support near-real-time data Upsert capabilities.

Although the two are slightly different in detail processing (such as Merge on Read or Write), the overall approach is to reduce the data update cycle to a shorter level by supporting incremental files (avoiding the indiscriminate FullMerge operation for updates on traditional Parquet/ORC), thereby achieving near real-time storage. Because the near real-time direction usually involves more frequent file merges and fine-grained metadata support, the interface is also more complex. Delta/Hudi is not a simple format, but a set of services.

The storage format will evolve towards real-time update support and will be combined with real-time indexing. It will no longer be a file storage format, but will be integrated with the memory structure to form an overall solution. The mainstream real-time update implementation is based on LogStructuredMergeTree (almost all real-time data warehouses) or Lucene Index (Elastic Search format).

From the perspective of the storage system's interface/internal functions, simpler interfaces and functions correspond to more open capabilities (such as GFS/HDFS), while more complex and efficient functions usually mean more closed capabilities, and gradually degenerate into a storage and computing integrated system (such as AWS's leading data warehouse product RedShift). Technologies from the two directions are converging.

Looking ahead, we see the following possible development directions/trends:

1) At the platform level , storage and computing separation will become the standard within two to three years, and the platform will develop in the direction of hosting and cloud native. Within the platform, refined stratification will become the key means to balance performance and cost (in this regard, current data lake products are far from enough), and AI will play a greater role in stratification algorithms.

2) At the Format level , it will continue to evolve, but major breakthroughs and upgrades will most likely depend on the evolution of new hardware (there is limited room for optimization of encoding and compression on general-purpose processors).

3) The further integration of data lakes and data warehouses makes storage more than just a file system. How thick the storage layer is and what the boundary with computing is are still key issues.

2.2 Distributed scheduling, based on cloud native, develops towards a unified framework and diversified algorithms

Computing resource management is the core capability of distributed computing, and its essence is to solve the problem of optimal matching of different types of loads and resources. In the "post-red ocean era", Google's Borg system and open source Apache Yarn are still key products in this field, and K8S is still catching up in the direction of big data computing scheduling.

Common cluster scheduling architectures include:

Centralized scheduling architecture: The early Hadoop 1.0 MapReduce, the subsequent Borg, and Kubernetes are all centralized scheduling frameworks, with a single scheduler responsible for assigning tasks to machines in the cluster. In particular, most systems use a two-level scheduling framework in the central scheduler. By separating resource scheduling from job scheduling, different job scheduling logic can be customized according to specific applications, while retaining the feature of sharing cluster resources between different jobs. Yarn and Mesos are both of this architecture.
Shared state scheduling architecture: semi-distributed mode. Each scheduler in the application layer has a copy of the cluster state, and the scheduler will update the cluster state copy independently. For example, Google's Omega and Microsoft's Apollo are both of this architecture.
Fully distributed scheduling architecture: The fully distributed architecture proposed by the Sparrow paper is more decentralized. There is no coordination between schedulers, and many independent schedulers are used to handle different loads.
Hybrid scheduling architecture: This architecture combines centralized scheduling and shared state design. There are generally two scheduling paths, distributed scheduling designed for part of the load, and centralized job scheduling to handle the remaining load.

(Picture: The evolution of cluster scheduler architectures by Malte Schwarzkopf)

Regardless of the architecture of the scheduling system of the big data system, in the process of massive data processing, it is necessary to have scheduling capabilities in the following dimensions:

Data scheduling : System services across multiple computer rooms and regions bring about global data distribution issues, which require optimal use of storage space and network bandwidth.
Resource Scheduling : The overall cloudification trend of IT infrastructure has brought greater technical challenges to resource scheduling and isolation. At the same time, as the scale of physical clusters continues to expand, decentralized scheduling architecture has become a trend.
Computing scheduling : The classic MapReduce computing framework has gradually evolved into an era of refined scheduling that supports dynamic adjustment, global optimization of data shuffle, and full use of hardware resources such as memory and network.
Single-machine scheduling : SLA guarantee under high resource pressure has always been the focus of academic and industrial circles. Borg and other open source explorations all assume that when resources conflict, online services are unconditionally tilted; however, offline services also have strong SLA requirements and cannot be sacrificed at will.

Looking ahead, we see the following possible development directions/trends:

1. K8S unified scheduling framework: Google Borg has long proved that unified resource management is conducive to optimal matching and peak-shifting. Although K8S still faces challenges in scheduling "non-online services", K8S's accurate positioning and flexible plug-in design should make it the ultimate winner. Big data schedulers (such as KubeBatch) are currently a hot investment.

2. Diversification and intelligence of scheduling algorithms: With the decoupling of various resources (for example, separation of storage and computing), scheduling algorithms can perform deeper optimization in a single dimension, and AI optimization is a key direction (in fact, many years ago, Google Borg used Monte Carlo Simulation to predict the resource requirements of new tasks).

3. Scheduling support for heterogeneous hardware: ARM with many-core architecture has become a hot topic in the field of general computing, and AI acceleration chips such as GPU/TPU have also become mainstream. The scheduling system needs to better support a variety of heterogeneous hardware and abstract simple interfaces. In this regard, K8S plug-in design has obvious advantages.

2.3 Unification of metadata services

Metadata services support the operation of the big data platform and the computing engines and frameworks above it. Metadata services are online services with high frequency and high throughput. They need to be able to provide high availability and high stability services, and need to have the capabilities of continuous compatibility, hot upgrades, and multi-cluster (copy) management. They mainly include the following three functions:

DDL/DML business logic, ensuring ACID characteristics, ensuring data integrity and consistency authorization and authentication capabilities, and ensuring data access security
Highly available storage and query capabilities of Meta (metadata) ensure the stability of operations

The metadata system of the first-generation data platform is Hive MetaStore (HMS) of Hive. In early versions, HMS metadata service is a built-in service of Hive. The consistency of metadata update (DDL) and DML job data reading and writing is strongly coupled with Hive engine. The storage of metadata is usually hosted in relational database engines such as MySQL.

As customers have higher and higher requirements for data processing consistency (ACID), openness (multi-engine, multi-data source), real-time, and large-scale scalability, traditional HMS is gradually limited to single cluster, single tenant, Hive-based internal use in a single enterprise. In order to ensure the security and reliability of data, the operation and maintenance costs remain high. These shortcomings are gradually exposed in large-scale production environments.

Representatives of the second-generation metadata systems include the open source Apache IceBerg and the metadata system of Alibaba's big data platform MaxCompute, a cloud-native system.

IceBerg is an open source big data platform that has emerged in the past two years as a "metadata system" that is independent of the engine and storage. The core issues it aims to solve are ACID for big data processing and the performance bottlenecks of table and partition metadata after scaling. In terms of implementation, IceBerg's ACID relies on the semantics of the file system POSIX, and the partition metadata is stored in files. At the same time, IceBerg's Table Format is independent of the metadata interface of Hive MetaStore, so the cost of engine adoption is very high, and each engine needs to be modified.

Based on the analysis of future hot spots and trends, open, managed unified metadata services are becoming increasingly important. Many cloud vendors have begun to provide DataCatalog services to support multi-engine access to lake and warehouse data storage layers.

Comparison between first and second generation metadata systems:

Looking ahead, we see the following possible development directions/trends:

1. Trend 1: With the further development of lake-warehouse integration, metadata will be unified , and the ability to access metadata and data on the lake will be built. For example, a unified metadata interface based on a set of account systems will support the ability to access metadata of the lake and warehouse. And the integration of ACID capabilities in multiple table formats will be a challenge for platform products when the scenarios for writing data on the lake become more and more diverse, and supporting Delta, Hudi, and IceBerg table formats will be a challenge for platform products.

2. Trend 2: The metadata permission system shifts to the enterprise tenant identity and permission system, and is no longer limited to a single engine.

3. Trend three: Metadata models begin to break through the structured model of the relational paradigm and provide richer metadata models that support labeling, classification, and the expression of custom types and metadata formats, support AI computing engines, and so on.

This article elaborates on the hot spots of the evolution of the big data system in the post-red ocean era, as well as the technical interpretation of some sub-domain architectures under the big data system.

<<: 5G new call concepts and key technologies

>>: A complete guide to the development of TCP/IP

How to solve the voice delay problem after core network upgrade

80VPS: Hong Kong server monthly payment starts from 420 yuan, US CN2 monthly payment starts from 650 yuan, Hong Kong/US cluster server starts from 750 yuan

Blog

CloudCone: $16.5/year - 1GB/50GB/3TB monthly traffic/Los Angeles data center

In 2020, my country's total base stations reached 9.31 million, and the interconnection bandwidth of the three major operators surged

Blog

80VPS: AMD Ryzen+NVMe Los Angeles Cera Data Center KVM annual payment starts at 349 yuan

Blog

Does Snowflake's popularity mean Hadoop is dead? What exactly is a big data system?

How to solve the voice delay problem after core network upgrade

51CTO takes you to the scene of the 2017 Huawei All Connect Conference wonderful photos

Application of SRv6 Technology in Home Network

Discussion on 5G network construction plan

80VPS: Hong Kong server monthly payment starts from 420 yuan, US CN2 monthly payment starts from 650 yuan, Hong Kong/US cluster server starts from 750 yuan

CloudCone: $16.5/year - 1GB/50GB/3TB monthly traffic/Los Angeles data center

Will the difficulties faced by the communications industry continue in 2019?

Tongcheng Yilong Wang Xiaobo: Cache should be managed in this way to handle high-concurrency scenarios with ease!

In 2020, my country's total base stations reached 9.31 million, and the interconnection bandwidth of the three major operators surged

80VPS: AMD Ryzen+NVMe Los Angeles Cera Data Center KVM annual payment starts at 349 yuan

Recommend

6 SD-WAN trends to watch in 2020

What are the Bluetooth versions? A brief discussion on Bluetooth 5.0 and its difference from Wifi

What is 5G voice like now?

F5 and H3C Sign SDN Solution Cooperation Agreement

How to realize LoRa networking without a gateway?

Intelligent wireless coverage solutions in complex industrial environments

In-depth interpretation of the principles and applications of HTTP/3

Let’s talk about the four major features of 5G

WiFi will be replaced? Not 5G

HostKvm Hong Kong VPS 30% off: $5.95/month KVM-2GB RAM/40GB hard drive/500GB monthly bandwidth

Reconnect the campus network after it is disconnected. Use crawlers to fix it!

What is 5G Dual Connectivity?

With the momentum of the rise in the 4G era, TDD technology is once again joining the global 5G competition

How to build a secure HTTPS site step by step

Summary of the State Council Information Office press conference, involving 5G, chips, etc.