Technical Tips | Alibaba Cloud's Practical Exploration of Building Lakehouse Based on Hudi

Technical Tips | Alibaba Cloud's Practical Exploration of Building Lakehouse Based on Hudi

1. Data Lake and Lakehouse

At the 2021 Developer Conference, one of our researchers shared a topic that mentioned a lot of data. The main point he wanted to make was that at this stage of the industry's development, data has expanded dramatically and the data growth rate is terrifying. Whether it is the scale of data or the real-time production processing, to the intelligent production processing, and the cloudification process of accelerating data migration to the cloud.

These data come from Gartner and IDC analysis, and are summarized from the most authoritative analysis reports in the industry. This means that we have great opportunities and challenges in the field of data, especially in the field of analysis.

In the case of massive amounts of data, we will face many challenges in order to truly mine and use the value of data. The first is that the existing architecture will gradually migrate to the cloud architecture; the second is the amount of data; the third is that Serverless pay-as-you-go will gradually become the default choice from a trial option; the fourth is that there are also diverse applications and heterogeneous data sources. I believe that everyone who has been in contact with the cloud knows that no matter which cloud vendor, there are many kinds of cloud services available, especially a large number of data services. At this time, a large number of data sources will definitely bring a problem: the difficulty of analysis, especially when you want to do association analysis, how to connect heterogeneous data sources is a big problem. The second is the differentiated data format. Usually, when we write data, we will choose a convenient and simple format, such as CSV and Json format, but for analysis, these formats are often very inefficient, especially when the data reaches the TB and PB level, it is impossible to analyze. At this time, Parquet, ORC and other analysis-oriented column storage formats are derived. Of course, it also includes link security and differentiated groups, etc., and the process of data volume expansion has increased a lot of analysis difficulties.

In real customer scenarios, a lot of data has been uploaded to the cloud and "entered the lake". What is a lake? Our definition and understanding of a lake is more like object storage such as AWS's S3 or Alibaba Cloud OSS. It is a simple and easy-to-use API form that can store a variety of differentiated data formats. It has many benefits such as unlimited capacity and pay-as-you-go. It was very troublesome to do analysis based on the lake before. Many times, it was necessary to do T+1 warehouse building and various cloud service delivery. Sometimes the data format is wrong and ETL needs to be done manually. If the data is already in the lake, metadata discovery and analysis must be done, etc. The entire operation and maintenance link is very complicated and there are many problems. These are the offline data lake problems that online customers actually face. Some have higher priorities, some are lower. In short, there are many problems.

In fact, Databricks began to shift its research focus from Spark to Lakehouse in 2019. They published two papers that provide theoretical definitions for how data lakes can be accessed uniformly and better.

Based on the new concept of Lakehouse, we want to shield various differences in formats, provide a unified interface for different applications, and simplify data access and data analysis capabilities. The architecture is about realizing the step-by-step evolution of data warehouse, data lake, and Lakehouse.

His two papers describe many new concepts: first, how to design and implement MVCC so that offline data warehouses can also have MVCC capabilities like databases, thereby meeting most of the needs for batch transactions; second, providing different storage modes to adapt to different read and write workloads; third, providing some near-real-time writing and merging capabilities to provide link capabilities for data volumes. In short, his ideas can better solve the problem of offline data analysis.

There are currently three relatively popular products in the industry. The first is Delta Lake, which is a data lake management protocol released by Databricks itself; the second is Iceberg, which is also an open source project of Apache; the third is Hudi, which was first developed internally by Uber and later open sourced (Hive's ACID was used more often in the early days). Currently, these three products can be connected to the HDFS API and can adapt to the underlying lake storage, while OSS can be adapted to the HDFS storage interface. Due to the similar core principles, the capabilities of the three products are gradually approaching each other, and with the theoretical support of the paper, we will have a direction to practice.

For us, we chose Hudi because of its product maturity and its ability to import data into the database lake. Its form meets our business needs for CDC in the database team.

Hudi was defined in the early days as the abbreviation of Hadoop Updates and Incrementals, followed by the concepts of Update, Delete, and Insert for Hadoop. Its core logic is transaction versioning, state machine control, and asynchronous execution. It simulates the entire MVCC logic and provides incremental management of internal columnar files such as Parquet, ORC, and other object lists to achieve efficient storage reading and writing. It is very similar to the Lakehouse concept defined by Databricks, and so is Iceberg. Its capabilities are also gradually improving in this direction.

The architecture provided by Hudi's official website is in this form. When we were doing technology selection and research before, we found that many peers have already made full use of Hudi to select solutions for data entry and offline data management. First, because the product is relatively mature; second, it meets the needs of our CDC; third, Delta Lake has an open source version and an internally optimized version, and only provides an open source version to the outside world. We believe that it does not necessarily present the best things. Iceberg started relatively late, and its capabilities were not as complete as the other two products in the early days, so we did not consider it. Because we are all Java teams and have our own Spark products, Hudi happens to be more in line with our ability to use our own runtime to support data entry into the lake, so we chose Hudi.

Of course, we have been paying attention to the development of these three products. Later, StarLake, an open source project in China, also did something similar. Each product is improving, and in the long run, their capabilities are basically aligned. I think they will gradually match the capabilities defined in the paper.

"Based on the columnar, multi-version format of open source Hudi, heterogeneous data sources are incrementally and low-latencyly entered into the lake and stored on open, low-cost object storage. In this process, the ability to optimize data layout and evolve metadata is required, and ultimately unified offline data management is achieved, with equal support for the above computing and analysis capabilities. This is the overall solution." This is our understanding of Lakehouse and the direction of our technical exploration.

2. Alibaba Cloud Lakehouse Practice

The following is an introduction to the technical exploration and specific practices of Alibaba Cloud Lakehouse. First, let’s briefly introduce the concept of “database, warehouse, and lake integration” strategy that the Alibaba Cloud database team has been promoting in recent years.

As we all know, database products are divided into four levels: DB; NewSQL/NoSQL products; data warehouse products; and lake data products. The higher the data value density, the greater it is. The data will be associated with the analysis in the form of meta-tables and meta-warehouses. For example, the DB data format is very simple and clear. The lower the data volume, the larger the data volume and the more complex the data format. There are various storage formats. The data lake format includes structured, semi-structured, and unstructured. To analyze, you must do some refining and mining to truly use the data value.

The four storage directions have their own fields, and at the same time, they also have demands for related analysis. The main purpose is to break the data islands and integrate the data to make the value more three-dimensional. If you only do some log analysis, such as related regions and customer sources, you only use relatively simple analysis capabilities such as GroupBy or Count. For the underlying data, it may be necessary to do multiple cleaning and reflux before it can be analyzed layer by layer in online and high-concurrency scenarios. Here, data can not only be written directly from the lake to the database, but also to the warehouse, to NoSQL/NewSQL products, to the KV system, and make good use of online query capabilities, etc.

The same is true in reverse. The data in these database/NewSQL products and even data warehouses will flow downward to build low-cost, large-capacity storage backup and archiving, reduce the storage pressure and analysis throughput pressure above, and form a powerful joint analysis capability. This is also my own understanding of the integration of databases, warehouses, and lakes.

I just talked about the development direction and positioning of the database. Now let's take a look at the positioning of Lakehouse in the hierarchical data warehouse system of OLAP itself under the database. Those who have worked on data warehouse products are more familiar with it than I am. (PPT diagram) Basically, it is such a hierarchical system. At the beginning, there are various forms of data storage outside the non-data warehouse or non-data lake system. We understand that through the ability of Lakehouse, we can enter the lake and build a warehouse. Through cleaning, precipitation and aggregation, we form an ODS or CDM layer. Here, we have done preliminary data aggregation and summary capabilities to form the concept of data mart.

We will store this data on the entire OSS based on the Hudi protocol and Parquet file format on Alibaba Cloud. We will further aggregate the initial data set into a clearer and more business-oriented data set through ETL internally, and then build ETL to import it into the real-time data warehouse, etc. Or these data sets are directly used for low-frequency interactive analysis, BI analysis, or for machine learning engines such as Spark, and finally output to the entire data application. This is the overall layered system.

Throughout the process, we will access a unified metadata system. Because if each part of the system has its own terminology and needs to retain its own metadata, it will be fragmented for the OLAP system. Therefore, the metadata must be unified, and the same applies to scheduling. Tables at different data warehouse levels must be connected in series in different places, and there must be a complete and unified scheduling capability. The above is my understanding of the positioning of Lakehouse in the OLAP system, which is mainly the ability to attach to the source layer and aggregate offline data.

The previous part introduced the positioning of Lakehouse in the database and OLAP team. The following part will focus on the design of Lakehouse in our field. Because I have used K8s to move the analysis system to the cloud, I am familiar with many concepts of K8s.

When we designed our own system, we also tried to refer to and learn from the K8s system. K8s has the DevOps concept that we often mention, which is a practical paradigm. In this paradigm, many instances will be created, and many applications will be managed in the instances. These applications will eventually be atomically scheduled and executed through Pods, and some business logic and various containers will run in the Pods.

We believe that Lakehouse is also a paradigm, a paradigm for processing offline data. Here, data sets are our core concept, for example, we need to build a set of data sets for a certain scenario or direction. We can define different data sets A, B, and C, which we think is an instance. We can orchestrate various workloads around this data set, such as DB entry into the lake. There are also analytical optimization workloads, such as index construction, such as technologies such as z-ordering, Clustering, and Compaction, which can improve query optimization capabilities. There are also management-type workloads, such as regularly cleaning up historical data and doing cold and hot storage tiering, because OSS provides many such capabilities, and we should make good use of these capabilities. The bottom layer is various jobs. We build offline computing capabilities based on Spark internally. We orchestrate the workloads into small jobs, and all atomic jobs are elastically executed on Spark. The above is our domain design for Lakehouse in technical practice.

This is the overall technical architecture. First, there are various data sources on the cloud. Through orchestration and definition of various workloads, they run on our own Spark elastic computing. The core storage is based on Hudi+OSS. We also support other HDFS systems, such as Alibaba Cloud's LindormDFS. The internal meta-information system manages meta-information such as libraries, tables, and columns. All management and control services are then scheduled based on K8s. The upper layer connects to computing and analysis capabilities through the native Hudi interface. This is the entire elastic architecture.

In fact, Serverless Spark is our computing foundation, providing job-level elasticity, because Spark itself also supports Spark Streaming, which implements stream computing by popping up a Spark job in a short time. We chose OSS and LindormDFS as the storage foundation, mainly to take advantage of the benefits of low cost and unlimited capacity.

In this architecture, how can we connect the user's data to achieve the ability to store and analyze data in the lake? The above is our security solution based on VPC. First of all, we use the shared cluster mode. The user side can connect through SDK and VPDN network, and then the internal gateway of Alibaba Cloud connects the computing cluster to achieve management and scheduling; then through Alibaba Cloud elastic network card technology, the user's VPC is connected to achieve data access, while achieving routing and network isolation capabilities. Different users may also have subnet segment conflicts. Through elastic network card technology, the same network segment can be connected to the same computing cluster at the same time.

Students who have used Alibaba Cloud OSS know that OSS itself is the public network in Alibaba Cloud VPC network. It is a shared area and does not require a complex network. RDS and Kafka are deployed in the user's VPC, and multiple networks can be connected through a set of network architectures. Compared with VPC network segment conflicts, there is no such problem in shared areas. Secondly, data is isolated, and ENI has end-to-end restrictions. For example, VPC has an ID mark and different authorization requirements. If an illegal user tries to connect to VPC, the network packet cannot be connected if it is not this network card, which can achieve safe isolation and data access.

The network architecture has been determined, so how to run it? Throughout the design, we will take the DSL design of K8s as an example. As mentioned earlier, many tasks will be defined. A workload may have many small tasks. At this time, we need to define a set of orchestration scripts similar to the DSL definition, such as job1, job2, and then job3. These orchestration scripts are submitted through the SDK, console, and other portals, and then received by the API Server and scheduled by the Scheduler. This Scheduler will be connected to the Spark gateway to implement task management, status management, task distribution, etc., and finally schedule the internal K8s to pull up the job for execution. Some full-volume jobs run once, such as pulling the DB once, and there are also resident streaming jobs, triggered asynchronous jobs, scheduled asynchronous jobs, etc., with different forms and the same scheduling capabilities, so they can be expanded. In the process, there are continuous feedback status of the job status, intermittent statistics, etc. In K8s, the K8s Master assumes such a role, and also has the roles of API Server and Scheduler. It is similar here, and the scheduling capability HA mechanism is also implemented through a one-master-multiple-slave architecture.

Here, why do we split a workload's user-facing tasks into N different jobs? Because these tasks are all run in one process, the water level of the entire workload changes greatly, and it is very difficult to perform elastic scheduling. It is enough to run all tasks once, but how many resources should be allocated? In many cases, Spark is not so flexible, especially asynchronous tasks and scheduled tasks, which consume a lot of resources, but after they are used up, it is unknown when the next time will come, which is difficult to predict. Just like many signal system processing, Fourier transform is required, splitting complex waveforms into multiple simple waveforms will simplify signal processing. We also have such a perceptual understanding. Using different jobs to execute tasks of different roles in the workload makes it easy to achieve elasticity. For example, for scheduled or temporary triggering jobs, temporarily pulling a job, the resource consumption is completely unrelated to the resident streaming tasks, and the stability of the streaming tasks, the delay in entering the lake, etc. can be completely unaffected. This is the thinking behind the design, which is to simplify complex problems. Because from the perspective of elasticity, the simpler the waveform is, the better the elasticity will be and the easier the prediction will be.

Entering the lake will involve a lot of user account and password information, because not all cloud products use systems such as AWS's IAM or Alibaba Cloud's RAM to build fully cloud-based resource permission control. Many products still use account and password methods for authentication and authorization management, including user-built systems, database systems, and so on. In this way, users have to hand over all connection accounts and passwords to us. How can we manage them more securely? We are based on two systems of Alibaba Cloud: one is KMS, which uses a hardware-level data encryption system to encrypt user data; the second is STS, a fully cloud-based three-party authentication capability to achieve secure access to user data, especially the isolation or protection mechanism of sensitive data. This is our entire system now.

Another problem is that different users are completely isolated through various mechanisms, but the same user has many tasks. In the Lakehouse concept, there are four layers of structure: one data set has multiple libraries, libraries have multiple tables, tables have different partitions, and partitions have different data files. Users have a sub-account system and various different jobs, so there may be mutual impact when operating data.

For example, different tasks entering the lake all want to write to the same table. Task A is already running online, but another user configures Task B and also wants to write to the same space. This may flush out all the data of Task A that has already been online, which is very dangerous. Other users may delete jobs, which may delete the data of online running tasks. Other tasks may still access the data but cannot perceive it. For example, other cloud services, other programs in VPC, self-deployed services, etc. may operate this table, causing data problems. Therefore, we have designed a complete set of mechanisms. On the one hand, it is a locking mechanism at the table level. If a task occupies a data write permission at the earliest, subsequent tasks are not allowed to write to it before the end of the life cycle of this task, and it cannot be dirty.

On the other hand, based on the Bucket Policy capability of OSS, we build the permission verification capability for different programs. Only tasks in Lakehouse are allowed to write data, while other programs are not allowed to write, but other programs can read. The data of the same account is originally for sharing, analysis, and access to various application scenarios. It can be read, but it must not be polluted. We have done reliability work in these aspects.

We talked more about the architecture system. Let's go back to the overall understanding of the data model. We believe that the whole process is centered on behavior (because the data warehouse is still a row of data, stored in the range of the table), and the row data is used to build a unified lake entry, storage, analysis, and metadata model. First, there are various data sources (text or binary, binlog is binary data; or various binary data can be stored in Kafka). These data are finally read through various Connectors and Readers (different systems have different names) and mapped into row data. In these row data, there is key descriptive information, such as source information, change type, etc., as well as a variable column set. Then through a series of rule transformations, such as filtering out certain data, generating primary keys for data, defining versions, type conversions, etc.; finally, through Hudi Payload encapsulation, conversion, metadata information maintenance, file generation, etc., it is finally written to the lake storage.

By maintaining metadata, partitions, and other data in storage and connecting to subsequent calculations and analysis, you can seamlessly see the metadata of all data stored in the lake and warehouse, and seamlessly connect to different application scenarios.

The following is an introduction to our support for common data source access forms. DB entry into the lake is the most common scenario. On Alibaba Cloud, there are products such as RDS and PolarDB. Taking the MySQL engine as an example, there are generally architectures such as master database, slave database, and offline database, and there may be master-slave access points, but the essence remains the same. DB entry into the lake requires a full synchronization first, and then incremental synchronization. For users, DB entry into the lake is a clear workload, but for the system, it is necessary to do a full synchronization first, and then automatically connect to the incremental synchronization. The data must also be connected through a certain mechanism to ensure the correctness of the data. The entire scheduling process obtains DB information through a unified management and control service, automatically selects the slave database or the instance with the least pressure online, performs full synchronization and writes it to the database, and maintains the corresponding Watermark, recording the time point when the full volume started, the delay between the slave database and the master database, etc. After the full volume is completed, the incremental task begins, using DTS and other synchronization binlog services, backtracking the data based on the previous Watermark, and starting to do incremental. By using the Upsert capability in Hudi, data can be merged according to a certain logic using user-defined PK and version to ensure eventual consistency of data and correctness of the analysis side.

There are many things to consider in Watremark maintenance. If the full volume fails, retry again and where should the site start? If the incremental volume fails, not only should we consider where the incremental volume has been before, but we should also maintain the incremental site gradually. We cannot roll back to the site before the full volume every time the incremental volume fails, otherwise the subsequent data delay will be too serious. Maintaining this information at the Lakehouse table level can automatically connect the processes such as workload running, restarting, and retrying, which is transparent to users.

The second is the entry of message-like products into the lake. We have also done some technical exploration and business attempts. Its data is not as clear as DB in schema. For example, in Alibaba Cloud's existing Kafka service, its schema has only two fields, Key and Value. Key describes the message ID, and value is customized. Most of the time, it is a Json or binary string. First of all, we need to solve how to map it into rows. There will be a lot of logical processing, such as doing some Schema inference first to get the original structure. The original nested format of Json is easier to store, but it is more difficult to analyze. It is only convenient to analyze it by flattening it into a wide table. Therefore, we need to do some nested flattening, format expansion and other logic, and then cooperate with the core logic mentioned above to finally realize file writing, metadata merging, etc. This metadata merging means that the number of columns in the source is uncertain. For different rows, sometimes there is this column, and sometimes there is no such column. For Hudi, the metadata needs to be maintained at the application layer. Schema Evolution in Lakehouse is the merging of schemas, compatible processing of columns, automatic maintenance of newly added columns, etc.

We have a solution based on Lindorm. Lindorm is our self-developed KV row storage that is compatible with large wide table interfaces such as HBase and Cassandra. It has a lot of historical files and a lot of log data. Through the internal LTS service adjustment, the full and incremental data are converted into column storage files in the Lakehouse mode to support analysis.

Kafka and SLS systems both have the concept of partitions and shards. When traffic changes greatly, the capacity needs to be automatically expanded or reduced. Therefore, the consumer side must actively perceive the changes and continue to consume without affecting the correctness of the data. In addition, this kind of data is Append-Only, which can make good use of Hudi's small file merging capabilities to make downstream analysis simpler, faster, and more efficient.

3. Best Practices for Customers

The above is the sharing of technical exploration. Next, I will introduce the application in customers. A previous cross-border e-commerce customer had the problem that DB data was not easy to analyze. Currently, they have PolarDB and MongoDB systems, and hope to put all data into the lake on OSS for analysis in near real time. The problem with Federated Analytics in the industry is that the original database is under great pressure when directly connecting to query data. The best way is to enter the lake into the offline lake for analysis. Build an offline lake warehouse through the Lakehouse method, and then connect it to calculation and analysis, or connect it to ETL clearly to avoid the impact on online data. The same architecture builds the overall data platform, and applications and analysis are flourishing without affecting anything.

The difficulty faced by this customer is that they have many libraries, tables, and various application cases. We have made many optimizations on Hudi and contributed more than 20 patches to the community to improve Hudi, including metadata integration and some Schema Evolution capabilities, which are also applied on the customer side.

Another customer needs near real-time analysis of Kafka logs. Their solution originally required many steps to be done manually, including entering the lake, data management, and merging small files. With the Lakehouse solution, customer data is connected, automatically merged into the lake, and metadata is maintained. Customers can directly apply it, and the internal connection is directly opened up.

There is another problem with small files. In their scenario, we will work with the Hudi community to build Clustering technology. Clustering is to automatically merge small files into large files, because large files are easier to analyze. Secondly, during the merging process, the data can be sorted according to certain specific columns, and the performance will be much better when accessing these data columns later.

IV. Future Prospects

Finally, I would like to share our team’s thoughts on the future and how Lakehouse can be applied.

First, more abundant data sources entering the lake. The important value of Lakehous lies in shielding various data differences and breaking data silos. There are various types of data in many systems on the cloud, which have great analytical value. In the future, more data sources need to be unified. If only one DB or Kafka is supported, customer value will not be maximized. Only when sufficient data is aggregated together to form a large offline lake warehouse and the complexity is shielded, the value to users will become more obvious. In addition to cloud products, there are other forms of entering the lake, such as proprietary clouds, self-built systems, and self-upload scenarios. The main thing is to strengthen the ability of the source layer.

Second, lower-cost and more reliable storage capabilities, centered around data lifecycle management. Because Alibaba Cloud OSS has a wide range of billing methods, supporting multiple storage types (standard storage, low-frequency storage, cold storage, and colder storage), etc., there are dozens of items in the billing logic, which most people do not fully understand. But for users, cost is always the core of the design, especially when building massive offline lake warehouses, because the amount of data is increasing and the cost is increasing.

I have met a customer before. He needs to store 30 years of data. Their business is stock analysis. They need to crawl all the data of exchanges and brokerages and transfer them to a large lakehouse. Because they need to do 30 years of analysis, cost optimization is very critical. The original online system could not handle the storage for a few months because the amount of data is too large. Analysis data has the characteristics of being accessed from cold to hot, from relatively low frequency to high frequency. Lakehouse uses these characteristics and automatically shields users from the complex maintenance of which directories need cold storage and which directories need hot storage by defining rules and logic, helping users to go a step further.

Third, stronger analysis capabilities. In addition to the Clustering mentioned above, Hudi also has Compaction in its ability to accelerate analysis. Clustering is the merging of small files. For example, in the log scenario, a file is generated for each batch written. These files are generally not very large, but the smaller and more fragmented the files are, the higher the access cost during analysis. To access a file, you need to authenticate, establish a connection, and access metadata. These processes are only performed once to access a large file, while the process of accessing a small file is multiplied, which is very costly. In the Append scenario, Clustering is used to quickly merge small files into large files to avoid the linear degradation of analysis performance caused by writing and ensure efficient analysis.

In Hudi, if it is a Merge On Read type table, such as Delete and Update, they will be quickly written to the log file, and the data will be merged during the subsequent read to form a complete logical data view. The problem here is also obvious. If there are 1,000 log files, each read needs to be merged 1,000 times, and the analysis capability must be seriously degraded. At this time, Hudi's Compaction capability will merge the log files regularly. As mentioned earlier, if it is to be implemented completely in the same lake entry job, especially file merging, the computing overhead is very large. When doing these heavy loads, it has a great impact on the delay of the lake entry link. It is necessary to achieve write delay guarantee through asynchronous scheduling. And these processes are all flexible. Whether it is 100 files to be merged or 10,000 files to be merged, they can be fast and flexible without affecting the delay, which is very advantageous.

Fourth, more scenario-based applications. I personally think that Lakehouse is still oriented towards the capabilities of the source layer, with a certain degree of aggregation. Because of the higher level of aggregation and real-time performance, there are more real-time data warehouse options. DorisDB and ClickHouse, which are popular in the industry, have great advantages in real-time high-frequency analysis. There are not many advantages in real-time analysis based on Hudi, Lakehouse, and OSS, so the focus is still on building the capabilities of the source layer.

Originally, all scenarios involved near real-time entry into the lake, but some users may not have so many real-time requirements, and periodic T+1 logical warehouse building can meet them. They can use Hudi+Lakehouse capabilities to query a portion of logical incremental data every day and write it to Hudi, maintain partitions, and implement Schema Evolution capabilities.

In the early days, as the amount of data grew, customers used database and table sharding to achieve logical splitting. When analyzing, they found that there were too many databases and tables, making analysis and association difficult. At this time, they could build a multi-database and multi-table merge and warehouse building capability, aggregate them into one table, and then analyze them.

Then there is cross-regional fusion analysis. Many customers have raised such requirements, especially overseas. Some customers must have part of their business overseas to serve overseas users, especially in the cross-border e-commerce scenario, and their procurement system, warehousing system, logistics system, distribution system, etc. are all built in China. What should we do if we want to integrate and analyze a lot of data? First of all, OSS provides cross-domain replication, but it is only at the data level and has no logic. Here, we can build the logic layer through Lakehouse, mix the data from different regions together, and after aggregating them to the same region, provide unified SQL join, union and other capabilities.

Finally, Hudi has the capabilities of TimeTravel and Incremental query. At this time, building incremental ETL to clean different tables is universal to a certain extent, making it easier for users to use. In the future, more scenario-based capabilities will be built in to make it easier for users to build and use lake warehouses!

<<:  Is Graph Data Science the Secret to Accelerating Machine Learning?

>>:  my country's 5G terminal connections are approaching 420 million, and the digital economy is gaining momentum

Recommend

What is the difference between artificial intelligence and machine learning?

【51CTO.com Quick Translation】 Artificial intellig...

5G technology can now read human emotions in public areas

[[403225]] The birth of a new and influential 5G ...

5G to B development requires strategic patience

After more than two years of development, 5G has ...

Gartner: Low-code will continue to grow in 2021

[[383502]] Industry data: Forrester predicts that...

GreenCloudVPS Kansas node is online, 2G memory package starts at $15 per year

GreenCloudVPS has launched its 30th data center p...

Animation explains TCP. If you still don’t understand, please hit me up

Preface The TCP three-way handshake process is a ...

H3C focuses on new infrastructure to safeguard 5G commercial use

The current "new infrastructure" boom i...

Edge Data Centers and the Impact of 5G

A new category of data center will become a major...