This article is compiled from the topic "Kuaishou's Scenario-Based Practice of Building Real-Time Data Warehouse Based on Flink" shared by Li Tianshuo, a data technology expert at Kuaishou, at the Flink Meetup in Beijing on May 22. The content includes: 1. Kuaishou real-time computing scenario 2. Kuaishou's real-time data warehouse architecture and safeguards 3. Kuaishou scenario problems and solutions 4. Future Planning 1. Kuaishou Real-time Computing Scenario The real-time computing scenarios in Kuaishou’s business are mainly divided into four parts: Core data at the company level: including the company's overall business performance, real-time core daily reports, and mobile data. This means that the team will have the company's overall performance indicators, as well as each business line, such as video-related and live-streaming-related, and each will have a core real-time dashboard; Real-time indicators of large-scale events: The core content is the real-time large screen. For example, for Kuaishou's Spring Festival Gala event, we will have an overall large screen to view the overall status of the event. A large-scale event will be divided into N different modules, and we will have different real-time data dashboards for different gameplay in each module; Data from the operational part: operational data mainly includes two aspects, one is the creator and the other is the content. For creators and content, on the operational side, for example, when launching a big V event, we want to see some information such as the real-time status of the live broadcast room and the traction of the live broadcast room on the market. Based on this scenario, we will do multi-dimensional data of various real-time large screens, as well as some data of the market. In addition, this also includes support for operational strategies, for example, we may discover some hot content and hot creators in real time, as well as some current hot situations. We output strategies based on these hot situations, which is also some of the support capabilities we need to provide; finally, it also includes C-end data display, for example, there are now creator centers and anchor centers in Kuaishou, there will be some such as the anchor's off-broadcast page, and part of the real-time data of the off-broadcast page is also done by us. Real-time features: include search recommendation features and advertising real-time features. 2. Kuaishou’s real-time data warehouse architecture and safeguards 1. Objectives and Difficulties 1.1 Objectives First of all, since we are engaged in data warehouse, we hope that all real-time indicators have corresponding offline indicators. We require that the overall data difference between real-time indicators and offline indicators be within 1%, which is the minimum standard. 1.2 Difficulties The first difficulty is the large amount of data. The overall daily ingress traffic data volume is about trillions. In the event of activities such as the Spring Festival Gala, the QPS peak can reach 100 million per second. 2. Real-time data warehouse - layered model Based on the above three difficulties, let's take a look at the data warehouse architecture: As shown above: The bottom layer has three different data sources, namely client logs, server logs, and Binlog logs; The first step is to digitize the business, which is equivalent to bringing in the business data; 3. Real-time data warehouse-safeguard measures Based on the above hierarchical model, let's take a look at the overall safeguards: The guarantee level is divided into three different parts: quality assurance, timeliness assurance and stability assurance. Let's first look at the quality assurance in the blue part. Regarding quality assurance, you can see that in the data source stage, we have done things like out-of-order monitoring of the data source, which we did based on our own SDK collection, as well as consistency calibration of the data source and offline. The calculation process in the R&D stage has three stages, namely the R&D stage, the online stage, and the service stage. The R&D stage may provide a standardized model, based on which there will be some benchmarks, and offline comparison and verification will be done to ensure that the quality is consistent; the online stage is more about service monitoring and indicator monitoring; in the service stage, if some abnormal situations occur, the Flink state will be pulled up first. If some scenarios that do not meet expectations occur, we will perform offline overall data repair. 3. Kuaishou scenario problems and solutions 1. PV/UV Standardization 1.1 Scenario The first question is PV/UV normalization, here are three screenshots: The first picture is a warm-up scene for the Spring Festival Gala event, which is equivalent to a way of playing. The second and third pictures are screenshots of the red envelope distribution activity and the live broadcast room on the day of the Spring Festival Gala. During the activity, we found that 60~70% of the demand was to calculate the information on the page, such as: How many people came to this page, or how many people clicked to enter this page; 1.2 Solution To abstract this scenario, the SQL is as follows: Simply put, it is to make a filter condition from a table, then aggregate it according to the dimension level, and then generate some Count or Sum operations. Based on this scenario, our initial solution is shown on the right side of the above figure. We used the Early Fire mechanism of Flink SQL to retrieve data from the Source data source, and then bucketed the data by DID. For example, the purple part was bucketed at the beginning. The reason for bucketing first is to prevent the problem of hot spots in a certain DID. After bucketing, there will be something called Local Window Agg, which is equivalent to adding data of the same type after the data is bucketed. After the Local Window Agg, the Global Window Agg is merged according to the dimension. The concept of merging buckets is equivalent to calculating the final result according to the dimension. The Early Fire mechanism is equivalent to opening a day-level window in the Local Window Agg and then outputting it once a minute. We encountered some problems during this process, as shown in the lower left corner of the above picture. There is no problem when the code is running normally. However, if there is a delay in the overall data or the historical data is traced back, such as Early Fire once a minute, because the amount of data will be large when tracing back the history, it may cause the history to be traced back at 14:00 and the data at 14:02 will be read directly, and the point at 14:01 will be lost. What will happen after it is lost? In this scenario, the curve at the top of the figure is the result of Early Fire's retrospective historical data. The horizontal axis is minutes, and the vertical axis is the page UV up to the current moment. We found that some points are horizontal, which means there is no data result, and then a sharp increase, then horizontal again, and then another sharp increase, and the expected result of this curve is actually the smooth curve at the bottom of the figure. To solve this problem, we used the Cumulate Window solution, which is also included in Flink 1.13 and has the same principle. A large day-level window is opened for the data, and a small minute-level window is opened under the large window. The data falls into the minute-level window according to the Row Time of the data itself. When the watermark advances past the event_time of the window, it triggers a downward dispatch. This solves the backtracking problem. The data itself falls in the real window, the watermark advances, and the trigger is triggered after the window ends. 2. DAU calculation 2.1 Background Here is an introduction to DAU calculation: We have more monitoring of the entire market's active devices, new devices and returning devices. Active devices refer to devices that have visited that day; Let’s take a look at how the logic should be calculated during the offline process. First, we calculate the active devices, merge them together, and then perform daily deduplication under one dimension. Then we associate the dimension table, which includes the first and last time of the device, that is, the time when the device was first accessed and last accessed as of yesterday. After getting this information, we can perform logical calculations, and then we will find that the newly added and returned devices are actually sub-labels of the active devices. The newly added devices are processed logically, and the returned devices are processed logically for 30 days. Based on this solution, can we simply write an SQL to solve this problem? In fact, we did this at first, but encountered some problems: The first problem is that there are 6 to 8 data sources, and the caliber of our market is often fine-tuned. If it is a single operation, it must be changed during each fine-tuning process, and the stability of the single operation will be very poor; 2.2 Technical Solution In response to the above problems, here is how we do it: As shown in the example above, the first step is to perform minute-level deduplication on the three data sources ABC according to the dimension and DID. After deduplication, three minute-level deduplication data sources are obtained. Then they are unioned together and the same logical operation is performed again. This is equivalent to the entry of our data source changing from trillions to tens of billions. After deduplication at the minute level and then at the day level, the resulting data source can be increased from tens of billions to billions. In the case of billions of data, we can associate data with services, which is a more feasible state. It is equivalent to associating the RPC interface of the user portrait, and finally writing it to the target Topic after obtaining the RPC interface. This target Topic will be imported into the OLAP engine to provide multiple different services, including mobile version services, large screen services, indicator dashboard services, etc. This solution has three advantages: stability, timeliness and accuracy. The first is stability. Loose coupling can be simply understood as when the logic of data source A and the logic of data source B need to be modified, they can be modified separately. The second is task scalability, because we split all the logic into very fine granularity, when some places have problems such as traffic problems, it will not affect the subsequent parts, so it is relatively simple to expand. In addition, there is service post-positioning and state controllability. 2.3 Delayed Calculation Scheme How do we deal with the disordered situation above? We have three solutions in total: The first solution is to use "did + dimension + minute" to deduplicate, and set the Value to "whether it has been there". For example, for the same did, if there is a message at 04:01, it will output the result. Similarly, if there is a message at 04:02 and 04:04, it will also output the result. But if there is a message at 04:01, it will be discarded, but if there is a message at 04:00, it will still output the result. Solution 1: Under the condition of tolerating 16 minutes of disorder, the state size of a single job is about 480G. Although this situation ensures accuracy, the recovery and stability of the job are completely uncontrollable, so we gave up this solution; 3. Operational scenarios 3.1 Background The operation scenario can be divided into four parts: The first is data large screen support, including analysis data of a single live broadcast room and analysis data of the market, which needs to be delayed by minutes and has high update requirements; There is basically no difference between the first three types, except that some are for specific business scenarios and some are for general business scenarios in terms of query mode. For the third and fourth types, the requirements for updates are relatively low, the requirements for throughput are relatively high, and the curves in the process do not require consistency. The fourth query mode is more about single-entity queries, such as querying content and what indicators there are, and has relatively high QPS requirements. 3.2 Technical Solution How do we do it for the 4 different scenarios above? First, let's look at the basic detail layer (left side of the picture). The data source has two links, one of which is the consumption flow, such as consumption information of live broadcasts, and views/likes/comments. After a round of basic cleaning, dimension management is performed. These upstream dimension information comes from Kafka. Kafka writes some content dimensions and puts them into KV storage, including some user dimensions. IV. Future Planning The above article discusses three scenarios in total. The first scenario is the calculation of standardized PU/UV, the second scenario is the overall solution for DAU, and the third scenario is how to solve it on the operation side. Based on these contents, we have some future plans, which are divided into 4 parts. The first part is to improve the real-time guarantee system: on the one hand, we will do some large-scale activities, including the Spring Festival Gala and subsequent regular activities. We have a set of specifications to build a platform for how to guarantee these activities; the second part is to formulate graded guarantee standards, which operations have what kind of guarantee level/standard, and there will be a standardized description; the third part is to promote the solution of engine platform capabilities, including some engines for Flink tasks. We will have a platform on this platform to promote norms and standards. |
<<: Building a streaming data lake using Flink Hudi
>>: 5G enters the second half, the difficulty of ToB lies in the "three highs"
The Internet we are familiar with used to mainly ...
background The Domain Name System (DNS) is a dist...
Labs Guide The User Plane Function (UPF) is an im...
UCloud's Golden Autumn Carnival event has end...
spinservers has released a promotion for March, o...
1. Write at the beginning Hello everyone, I'm...
What is 5G NR 5G sets new standards for mobile co...
I just don't love you anymore, a song that ca...
1. Is it my fault that the signal is weak? Whethe...
The great progress of social productivity has giv...
At 2 pm on April 17, AsiaInfo Security's AE V...
Over the past decade, advances in cloud computing...
5G is not just for operators; vertical industries...
UFOVPS has released the biggest discount this yea...
Recently, the Ministry of Industry and Informatio...