PV (Visit Volume): Page View, which is the number of page views or clicks. It is calculated every time the user refreshes the page. UV (Unique Visitor): A computer client that visits your website is a visitor. The same client is only counted once between 00:00 and 24:00. Calculating the real-time PV and UV of website apps is a very common statistical requirement. Here we provide a general calculation method that can be used directly with only minor modifications to meet different business needs. needUse Flink to perform real-time statistics from 0 o'clock to the current pv and uv. 1. Demand AnalysisThe data sent from Kafka contains: timestamp, time, dimension, and user id. It is necessary to count the pv and uv from 0:00 to the current time from different dimensions, and start counting the next day again at 0:00 the next day. 2. Technical SolutionKafka data may be delayed and out of order, so watermark is introduced here; Divide the data into different scrolling windows through keyBy, and calculate pv and uv in each window; Since the state of one day needs to be saved, ValueState is used in the process to save pv and uv; Use BitMap type ValueState, which takes up very little memory and introduces dependencies that support bitmap; To save the status, you need to set the ttl expiration time, and expire the first day's status on the second day to avoid excessive memory usage. 3. Data PreparationHere we assume that it is user order data, and the data format is as follows:
4. Code ImplementationThe screenshot of the entire project code is as follows (some information that is not convenient for public disclosure is erased): pvuv-project 1. Environmentkafka:1.0.0; Flink: 1.11.0; 2. Send test dataFirst, send data to the Kafka test cluster, Maven dependencies:
Send code:
3. Main proceduresFirst define a pojo:
The next step is to use Flink to consume Kafka, specify Watermark, split the data using KeyBy, and enter the rolling window function to save pv and uv through the state.
The code is pretty standard, but there are a few things to note. Notice Set watermark. Use WatermarkStrategy in flink1.11. The old one has been deprecated. The timestamp in my data is in seconds, so I need to multiply it by 1000. Flink extracts the time field, which must be in milliseconds. .window only passes in one parameter, indicating that it is a tumbling window, TumblingEventTimeWindows.of(Time.days(1), Time.hours(-8)). Here, the window size is specified as one day. Since Beijing time in China is Eastern Time Zone 8, which is 8 hours earlier than the international time, an offset needs to be introduced. You can enter the source code of this method to view the English comments. Rather than that, if you are living in somewhere which is not using UTC±00:00 time, * such as China which is using UTC+08:00, and you want a time window with size of one day, * and window begins at every 00:00:00 of local time, you may use {@code of(Time.days(1),Time.hours(-8))}. * The parameter of offset is {@code Time.hours(-8))} since UTC+08:00 is 8 hours earlier than UTC time. It is obviously unreasonable to trigger the calculation once a day for a window of one day according to the watermark mechanism. We need to use the trigger function to specify the trigger interval as 10s, so that our pv and uv are updated every 10s. 4. Key code, calculate uvSince the user ID here is just a number, bitmap can be used for deduplication. The simple principle is: use user_id as the offset of the bit, set it to 1 to indicate access, and use 1 MB of space to store the daily access counts of more than 8 million users. Redis comes with its own bit data structure, but in order to rely as little as possible on external storage media, we implement bit ourselves here and introduce the corresponding Maven dependency:
The codes for calculating pv and uv are actually universal and can be quickly modified according to your actual business situation:
Notice Since the first day's data is no longer needed when calculating UV for the second day, the previous day's status in the memory must be cleared in time and expired through the TTL mechanism; The final results are saved in MySQL. If there are too many data result classifications and aggregations, pay attention to the pressure on MySQL, which can be optimized by yourself; 5. Other methodsIn addition to using bitmap for deduplication, you can also use Flink SQL, which has simpler coding, and you can also use external media Redis for deduplication:
The specific idea is to calculate pv and uv and put them into redis, and then get the values to save the statistical results, which is also commonly used. |
The emergence of cloud computing has brought grea...
Enterprises are transforming their networks to be...
South Korea's Ministry of Information and Com...
What is IPv6 conversion service? IPv6 Translation...
Different from the era from 3G to 4G, the evoluti...
At the Asia-Pacific CDN Summit in April, George, ...
[[379338]] Preface Not long ago, I encountered th...
Weekend nights are all about having fun! Playing ...
DogYun (狗云) has launched a promotion for its thir...
Modernity brought new and groundbreaking things t...
When the word "radiation" is mentioned,...
Since June 6, when the Ministry of Industry and I...
As we all know, Huawei's 5G technology is the...
Today, most businesses realize that in order to a...
In today's digital era, AI and large model ap...