background Every time the bell of the Double 11 global shopping carnival rings, tens of millions of users flock to Tmall and Taobao. Behind the smooth shopping experience is the camp created by Alibaba engineers with technology, which supports the data peak brought by Double 11 every year. From November 1 to 0:00 on November 12, 2020, the cumulative total transaction volume of Tmall's "Double 11" reached 498.2 billion yuan, and the total number of logistics orders reached 2.321 billion. Behind all this, real-time computing technology is indispensable. As an enterprise-level intelligent data construction and management product, Dataphin has full-link real-time R&D capabilities and has supported the group's Tmall Double 11 real-time computing needs since 2019. The following article introduces Dataphin's real-time computing capabilities. Traditional data warehouse architecture In the process of building a data warehouse, generally speaking, we first build an offline data warehouse and build applications around offline data. Then, as the business develops or the experience is optimized, we build a real-time computing link to improve the timeliness of the data. In this process, it is inevitable to write similar code twice, and various problems will arise, such as inconsistency between real-time and offline calibers and increased maintenance costs. The traditional data warehouse architecture separates streams and batches from storage and computing, which leads to the following problems: Efficiency issues: The underlying data models of the stream and batch are inconsistent, which leads to a large amount of splicing logic (year-on-year, month-on-month, secondary processing, etc.) in the application layer, resulting in low construction efficiency and prone to errors. Dataphin's advantages in stream and batch integration In order to solve the problem of storage and computing separation in the traditional data warehouse architecture, the idea of "stream and batch integration" was proposed: Stream and batch storage are transparent, query logic is completely consistent, application-side access costs are greatly reduced, point query/OLAP analysis is unified and supported. Unified storage at the service layer, no manual synchronization is required, no duplication of storage. One set of code, two computing modes, unified logic, flexible switching, greatly improved R&D efficiency. Stream and batch computing resources are mixed, and resource utilization is improved. Dataphin provides more platform capabilities on top of Flink's streaming and batch integration capabilities, such as data source management, metadata management, asset lineage, asset quality control, precompilation, and debugging. Development and production isolation: Provide isolation between the development environment and the production environment to ensure that the business code developed in the development environment does not interfere with production Stability and quality assurance: Supports traffic threshold setting to prevent excessive competition for computing resources and avoid overload of downstream systems. Supports real-time meta-table quality monitoring, and can configure statistical trend monitoring, real-time multi-link comparison, and real-time offline data verification. Development and production isolation Dataphin supports projects with development and production isolation, and supports data source configuration for both development and production environments. In this way, in development mode, tasks will automatically use the development data source and physical tables in the development environment; and when released to the production environment, Dataphin will automatically switch to the production data source and physical tables in the production environment. This process is fully automated, without the need to manually modify code or configuration. Metadata Management Dataphin creatively introduced the concepts of real-time meta-tables and mirror tables, which unified the management of tables in the real-time R&D process in a platform-based and asset-based manner, simplified R&D, and improved R&D efficiency and experience. Traditional real-time task development tools require users to repeatedly write Create table statements and perform tedious input and output table mapping operations. Real-time metatables build and manage all data tables used in real-time development tasks in a unified manner, and uniformly maintain all real-time metatables and related schema information. Developers do not need to repeatedly write DDL statements during the development process; at the same time, they do not need to perform complicated input, output, and dimension table mapping. Using a simple pure code development mode, simple SET statements and permission applications, you can reference table data, perform direct queries or write data, and easily build a table once and reference it multiple times, greatly improving R&D efficiency and experience. As the name implies, the mirror table is used to maintain the mapping relationship between the fields of the offline table and the real-time table. After creating the mirror table and submitting it for publication, you can use the fields of the mirror table in the integrated stream and batch Flink task. Datpahin will automatically map them to the stream table and batch table during compilation, so that one code can be used for two calculations, and the code logic and caliber become more consistent. Stream and batch code tasks In addition to the introduction of real-time meta tables and mirror tables, Dataphin also supports integrated streaming and batch tasks, using the Flink engine as a unified streaming and batch computing engine. It can configure streaming and batch tasks on the same code, and generate instances in different modes based on the same code. Dataphin also provides different ways to support differentiated streaming and batch codes. Mirror tables are widely used in stream-batch integration tasks, and mirror tables will be translated into corresponding stream tables/batch tables when they are finally used. In order to adapt to the diversity of stream tables/batch tables (the data sources of stream tables/batch tables may be different, resulting in different keys in the with parameter; some settings of stream tables/batch tables may be different, such as batchSize, etc.), tableHints can be used to correspond to stream tables/batch tables. The method is as follows: set project.table.${mode}.${key} --mode: Stream task: `stream` Batch task: batch For example, to set the start and stop time of a batch task: set project.table.batch.startTime='2020-11-11 00:00:00'; set project.table.batch.endTime='2020-11-12 00:00:00'; The second way is to use task parameters to replace the task parameters in Dataphin's task configuration for real-time and offline modes respectively. Real-time quality monitoring Dataphin's real-time data quality is mainly aimed at developers. It analyzes and verifies the data quality of the real-time data tables in the product to ensure the final validity and accuracy of the data. Dataphin supports statistical trend monitoring, real-time multi-link comparison, and real-time offline data verification. Statistical trend monitoring: Trend monitoring refers to a monitoring method based on data trend changes and expert experience to capture fluctuation anomalies; for example, the trend of real-time GMV increases sharply, which is somewhat abnormal. Real-time multi-link trend comparison: Real-time multi-link refers to the situation in which, in real-time computing scenarios, the cost of data recovery is high and it is impossible to quickly recalculate from the starting point. Therefore, multiple computing links are required. When a computing anomaly occurs, the computing link is automatically/manually switched. This is a strategy of exchanging resources for stability. This type of strategy is often used when there are major security services. For example, multi-link security is used for the large screens during Double Eleven every year. Real-time offline verification: Real-time offline verification is a common measure to ensure real-time data. Since real-time computing is in a continuous computing state, the computing time is long and is greatly disturbed by resources and source data. Offline data can be better operated in terms of logic and data reusability. Therefore, in order to ensure the accuracy of real-time data, offline data is often compared with real-time data. For example, offline data is used to verify real-time data before Double Eleven every year. Dataphin behind the big screen on Double 11 Back to the Tmall Double Eleven event at the beginning of the article, now that we have learned about the unique capabilities of the Dataphin platform, let’s break down in detail why Dataphin can support the real-time data screen for Tmall Double Eleven.
quick Dataphin provides one-stop services for the entire chain of R&D, debugging, testing, and operation and maintenance in real time, greatly reducing the user development threshold; stable Dataphin provides all-round task monitoring and data quality monitoring to ensure task stability and quickly discover problems. The template-based active-standby multi-link can switch within seconds when an abnormality occurs to quickly stop the bleeding. Based on real-time task lineage, the root cause of the problem can be quickly located. Based on debugging, testing, and fine-grained resource configuration, it can quickly verify and repair, truly achieving 1 minute discovery, 5 minutes positioning, and 10 minutes resolution. allow Based on the capability of stream and batch integration, we can truly achieve unified code, unified caliber, unified storage, and unified data service interface, thus improving R&D efficiency while ensuring data consistency. Future plans The upcoming Flink VVP (Ververica Platform) adaptation version will support the new VVR engine, and will also support the open source Flink engine to support more deployment environments in the future. Dataphin will also continue to improve the capabilities and experience of real-time R&D, helping enterprises lower the threshold for real-time R&D, explore more scenarios, and obtain the business value brought by real-time data! |
<<: Technical Life Part 5-A brief discussion on how to become the number one technician?
>>: The Dilemma and Hope of SRv6
edgeNAT is a Chinese VPS host established in 2019...
There has been much to prove about 5G’s theoretic...
[[177287]] According to the "China Broadband...
According to recent announcements, Kerlink and Ra...
Last time we shared the news of V.PS Hong Kong ne...
On April 16-17, the 2021 University Informatizati...
One day you get tired of it and want to switch to...
[51CTO.com original article] Not long ago, the 21...
What are the most important things that enterpris...
Not long ago, there was news that China Unicom wa...
[[281978]] Nginx was a very representative open s...
At present, more than 100 operators around the wo...
LiteServer is a foreign hosting company founded i...
Preface Web Real-Time Communication (WebRTC) is a...
In the ever-changing information age, companies t...