「Product News」Interpretation of Dataphin's real-time R&D of batch and stream integration

「Product News」Interpretation of Dataphin's real-time R&D of batch and stream integration

background

Every time the bell of the Double 11 global shopping carnival rings, tens of millions of users flock to Tmall and Taobao. Behind the smooth shopping experience is the camp created by Alibaba engineers with technology, which supports the data peak brought by Double 11 every year. From November 1 to 0:00 on November 12, 2020, the cumulative total transaction volume of Tmall's "Double 11" reached 498.2 billion yuan, and the total number of logistics orders reached 2.321 billion. Behind all this, real-time computing technology is indispensable.

As an enterprise-level intelligent data construction and management product, Dataphin has full-link real-time R&D capabilities and has supported the group's Tmall Double 11 real-time computing needs since 2019. The following article introduces Dataphin's real-time computing capabilities.

Traditional data warehouse architecture

In the process of building a data warehouse, generally speaking, we first build an offline data warehouse and build applications around offline data. Then, as the business develops or the experience is optimized, we build a real-time computing link to improve the timeliness of the data.

In this process, it is inevitable to write similar code twice, and various problems will arise, such as inconsistency between real-time and offline calibers and increased maintenance costs.

The traditional data warehouse architecture separates streams and batches from storage and computing, which leads to the following problems:

Efficiency issues: The underlying data models of the stream and batch are inconsistent, which leads to a large amount of splicing logic (year-on-year, month-on-month, secondary processing, etc.) in the application layer, resulting in low construction efficiency and prone to errors.
Quality issues: One business logic, two engines, two sets of code, SQL logic cannot be reused, data consistency and quality issues are difficult to guarantee
Cost issues:
Stream and batch storage system isolation (for different writing scenarios), different data services provided, high maintenance cost, manual data synchronization tasks, high development cost/storage cost (two copies)
Batch and stream processing clusters cannot achieve peak shifting, and resource utilization is low

Dataphin's advantages in stream and batch integration

In order to solve the problem of storage and computing separation in the traditional data warehouse architecture, the idea of ​​"stream and batch integration" was proposed:

Stream and batch storage are transparent, query logic is completely consistent, application-side access costs are greatly reduced, point query/OLAP analysis is unified and supported. Unified storage at the service layer, no manual synchronization is required, no duplication of storage. One set of code, two computing modes, unified logic, flexible switching, greatly improved R&D efficiency. Stream and batch computing resources are mixed, and resource utilization is improved.

Dataphin provides more platform capabilities on top of Flink's streaming and batch integration capabilities, such as data source management, metadata management, asset lineage, asset quality control, precompilation, and debugging.

Development and production isolation: Provide isolation between the development environment and the production environment to ensure that the business code developed in the development environment does not interfere with production
Metadata management: Each system component, including data source, meta table, UDX, etc., has permission control function, and sensitive configuration information is encrypted and protected. It supports access subscription of sensitive fields of data source. Meta tables, functions, resources, etc. are all managed in a unitized and visualized manner, and cross-project authentication (field level) calls are supported, allowing users to focus on business logic.
Stream and batch integration: unified management of the stream and batch storage layer, achieving unified model layer, unified stream and batch code, and producing independent and coordinated scheduling instances through exclusive configuration of stream and batch.
R&D efficiency improvement:
It provides pre-compilation capabilities, syntax verification, permission verification, and field lineage extraction functions;
Containerized debugging supports uploading custom data or directly consuming real production data to observe job operation and check the output results of each node. It supports metadata retrieval and visual exploration of job dependencies and field lineage.

Stability and quality assurance:

Supports traffic threshold setting to prevent excessive competition for computing resources and avoid overload of downstream systems. Supports real-time meta-table quality monitoring, and can configure statistical trend monitoring, real-time multi-link comparison, and real-time offline data verification.

Development and production isolation

Dataphin supports projects with development and production isolation, and supports data source configuration for both development and production environments. In this way, in development mode, tasks will automatically use the development data source and physical tables in the development environment; and when released to the production environment, Dataphin will automatically switch to the production data source and physical tables in the production environment. This process is fully automated, without the need to manually modify code or configuration.

Metadata Management

Dataphin creatively introduced the concepts of real-time meta-tables and mirror tables, which unified the management of tables in the real-time R&D process in a platform-based and asset-based manner, simplified R&D, and improved R&D efficiency and experience.

Traditional real-time task development tools require users to repeatedly write Create table statements and perform tedious input and output table mapping operations. Real-time metatables build and manage all data tables used in real-time development tasks in a unified manner, and uniformly maintain all real-time metatables and related schema information. Developers do not need to repeatedly write DDL statements during the development process; at the same time, they do not need to perform complicated input, output, and dimension table mapping. Using a simple pure code development mode, simple SET statements and permission applications, you can reference table data, perform direct queries or write data, and easily build a table once and reference it multiple times, greatly improving R&D efficiency and experience.

As the name implies, the mirror table is used to maintain the mapping relationship between the fields of the offline table and the real-time table. After creating the mirror table and submitting it for publication, you can use the fields of the mirror table in the integrated stream and batch Flink task. Datpahin will automatically map them to the stream table and batch table during compilation, so that one code can be used for two calculations, and the code logic and caliber become more consistent.

Stream and batch code tasks

In addition to the introduction of real-time meta tables and mirror tables, Dataphin also supports integrated streaming and batch tasks, using the Flink engine as a unified streaming and batch computing engine. It can configure streaming and batch tasks on the same code, and generate instances in different modes based on the same code. Dataphin also provides different ways to support differentiated streaming and batch codes.

Mirror tables are widely used in stream-batch integration tasks, and mirror tables will be translated into corresponding stream tables/batch tables when they are finally used. In order to adapt to the diversity of stream tables/batch tables (the data sources of stream tables/batch tables may be different, resulting in different keys in the with parameter; some settings of stream tables/batch tables may be different, such as batchSize, etc.), tableHints can be used to correspond to stream tables/batch tables. The method is as follows:

set project.table.${mode}.${key} --mode: Stream task: `stream` Batch task: batch

For example, to set the start and stop time of a batch task:

set project.table.batch.startTime='2020-11-11 00:00:00'; set project.table.batch.endTime='2020-11-12 00:00:00';

The second way is to use task parameters to replace the task parameters in Dataphin's task configuration for real-time and offline modes respectively.

Real-time quality monitoring

Dataphin's real-time data quality is mainly aimed at developers. It analyzes and verifies the data quality of the real-time data tables in the product to ensure the final validity and accuracy of the data. Dataphin supports statistical trend monitoring, real-time multi-link comparison, and real-time offline data verification.

Statistical trend monitoring: Trend monitoring refers to a monitoring method based on data trend changes and expert experience to capture fluctuation anomalies; for example, the trend of real-time GMV increases sharply, which is somewhat abnormal.

Real-time multi-link trend comparison: Real-time multi-link refers to the situation in which, in real-time computing scenarios, the cost of data recovery is high and it is impossible to quickly recalculate from the starting point. Therefore, multiple computing links are required. When a computing anomaly occurs, the computing link is automatically/manually switched. This is a strategy of exchanging resources for stability. This type of strategy is often used when there are major security services. For example, multi-link security is used for the large screens during Double Eleven every year.

Real-time offline verification: Real-time offline verification is a common measure to ensure real-time data. Since real-time computing is in a continuous computing state, the computing time is long and is greatly disturbed by resources and source data. Offline data can be better operated in terms of logic and data reusability. Therefore, in order to ensure the accuracy of real-time data, offline data is often compared with real-time data. For example, offline data is used to verify real-time data before Double Eleven every year.

Dataphin behind the big screen on Double 11

Back to the Tmall Double Eleven event at the beginning of the article, now that we have learned about the unique capabilities of the Dataphin platform, let’s break down in detail why Dataphin can support the real-time data screen for Tmall Double Eleven.

[[414768]]

quick

Dataphin provides one-stop services for the entire chain of R&D, debugging, testing, and operation and maintenance in real time, greatly reducing the user development threshold;
It also provides unified metadata management. Metadata only needs to be initialized once, so you can easily create a table once and reference it multiple times, allowing developers to focus on business logic and greatly improve R&D efficiency and experience.
In addition, students who have experience in data research and development have this experience: many data calibers are surprisingly similar, and some are just different in input and output tables. A typical scenario is the primary and backup links. For this scenario, we provide the ability to develop templates. The same logic is encapsulated in the template, and the difference logic is reflected through the template parameters. New tasks only need to reference the template and configure the template parameters, which greatly improves the research and development efficiency while reducing the caliber maintenance cost.
Based on the above capabilities, in terms of supporting the Double Eleven big screen, although there are many business operations and the demand is surging, it is still only two people who support hundreds of demands.

stable

Dataphin provides all-round task monitoring and data quality monitoring to ensure task stability and quickly discover problems. The template-based active-standby multi-link can switch within seconds when an abnormality occurs to quickly stop the bleeding. Based on real-time task lineage, the root cause of the problem can be quickly located. Based on debugging, testing, and fine-grained resource configuration, it can quickly verify and repair, truly achieving 1 minute discovery, 5 minutes positioning, and 10 minutes resolution.

allow

Based on the capability of stream and batch integration, we can truly achieve unified code, unified caliber, unified storage, and unified data service interface, thus improving R&D efficiency while ensuring data consistency.

Future plans

The upcoming Flink VVP (Ververica Platform) adaptation version will support the new VVR engine, and will also support the open source Flink engine to support more deployment environments in the future. Dataphin will also continue to improve the capabilities and experience of real-time R&D, helping enterprises lower the threshold for real-time R&D, explore more scenarios, and obtain the business value brought by real-time data!

<<:  Technical Life Part 5-A brief discussion on how to become the number one technician?

>>:  The Dilemma and Hope of SRv6

Recommend

5G vs. WiFi 6: Tips for choosing the best wireless network option

There has been much to prove about 5G’s theoretic...

Kerlink and Radio Bridge provide LoRaWAN solutions for private IoT networks

According to recent announcements, Kerlink and Ra...

Why are iOS and Android game data not interoperable? The truth is revealed

One day you get tired of it and want to switch to...

Implementing P2P video streaming using WebRTC

Preface Web Real-Time Communication (WebRTC) is a...

Is intelligent virtualization technology eliminating data silos?

In the ever-changing information age, companies t...