1. Introduction to Ad Anti-Cheat1.1 Definition of Ad Traffic Anti-CheatAdvertising traffic cheating refers to the media using various cheating methods to gain benefits from advertisers. Fraudulent traffic mainly comes from:
1.2 Common cheating behaviors
1.3 Common types of cheatingAccording to the order of advertising process
2. Advertising Traffic Anti-Cheat Algorithm System2.1 Application Background of Algorithm Models in Business Risk ControlIntelligent risk control uses a large amount of behavioral data to build models to identify and perceive risks. Compared with rule-based strategies, it significantly improves the accuracy, coverage and stability of identification. Common unsupervised algorithms:
Common supervised algorithms:
2.2 Advertising Traffic Model Algorithm SystemThe system is divided into four layers:
3. Application Cases of Algorithm Models3.1 Material Interaction Induces PerceptionBackground: Adding a virtual X close button to the ad creative causes users to click the fake X button when closing the ad, resulting in invalid click traffic and affecting user experience. The left picture shows the original creative, and the right picture shows a heat map of the coordinates of user clicks. The virtual X causes invalid click traffic when users close the ad. Model recognition perception: 1. Density Clustering (DBSCAN): Let’s define a few concepts first:
2. Apply algorithms to perceive ads that induce mis-touch: ① First, group the click data by resolution and ad position, and filter out the smaller groups; ② For each group, use the density clustering algorithm to cluster, set the neighborhood density threshold to 10, radius ε=5, and perform clustering training; ③For each group, after density clustering, filter out clusters with smaller cluster areas. The specific training code is as follows: ④Effect monitoring and crackdown: for the mined clusters, associate the post-click indicators, re-check the ad slots with abnormal conversion indicators, and deal with the ad slots with problems in the re-check. 3.2 Click Anti-Cheat Model3.2.1 Background A cheating click identification model is established for the click link of advertisements to improve the anti-cheating audit coverage, discover high-dimensional hidden cheating behaviors, and effectively supplement the strategic anti-cheating audit of click scenarios. 3.2.2 Construction Process (1) Feature construction Based on the token granularity, calculate the granular features of devices, IP addresses, media, and ad slots before an event occurs. Frequency characteristics: exposure, click, and installation behavior characteristics in the past 1 minute, 5 minutes, 30 minutes, 1 hour, 1 day, 7 days, etc., that is, the corresponding mean, variance, dispersion and other characteristics; Basic attribute characteristics: media type, advertising type, device legitimacy, IP type, network type, device value level, etc. 2. Model training and results ① Sample selection:
② Feature preprocessing:
③ Model training: The random forest algorithm is used to classify click fraud behaviors. Random forest has many advantages, such as: (1) It can process very high-dimensional data without feature selection; (2) An unbiased estimate is used for the generalization error, and the model has strong generalization ability; (3) Fast training speed and easy to parallelize (trees are independent of each other during training); (4) Strong anti-overfitting ability; Hyperparameter search optimization, use ParamGridBuilder, configure max_depth (maximum tree depth), numTrees (number of trees) and other hyperparameters to search and optimize the optimal hyperparameters. ④ Model stability monitoring: After the model is launched, if features migrate over time, there will be changes in the feature distribution at inference time and at training time, and the model stability needs to be monitored and updated iteratively; First, archive the training samples of the current version, calculate the PSI value of each feature of the inference time data and the training time data, and visualize the calculated PSI value (Population Stability Index) for daily monitoring and alarm. ⑤ Model interpretability monitoring: After the model is launched, in order to more intuitively locate the cause of the risk of hitting the model, the inference data is monitored for interpretability. That is, for each piece of data, its impact on the predicted label is calculated; The Shapley value (Shapley Additive explanation) is used to explain how features affect the output of the model. The Shapley value is calculated and output to the visualization platform for daily operational analysis. 3.3 Click sequence anomaly detection3.3.1 Background Through the user's hourly click volume sequence, we can mine the devices corresponding to malicious behaviors, and mine and detect abnormal pattern user groups that are far away from the majority of normal behaviors, such as abnormal groups that only have low-frequency click behaviors between 0:00 a.m. and 6:00 a.m. and no click behaviors at other times, or abnormal pattern users with balanced click behaviors every hour, etc. 3.3.2 Construction Process (1) Feature construction Taking the device as the user, the number of clicks per hour in the past 1/7/30 days is counted to form 1*24 hours, 7*24 hours, and 30*24 hours click sequence. The constructed features have the feature completeness on the time scale and the continuity condition of each feature data, which is suitable for anomaly detection algorithms. (2) Model selection Isolation forest outlier detection algorithm,The algorithm is based on two theoretical assumptions, namely, the proportion of abnormal data in the total sample size is very small, and the characteristic values of abnormal points are very different from those of normal points. Detect points that are sparsely distributed and far away from the densely populated group. For example, as can be seen in the figure below, the relatively more abnormal Xo only needs 4 cuts to be separated from the whole, that is, it is "isolated", while the more normal Xi point is separated from the whole after 11 cuts. (3) Model training The IsolationForest algorithm is used to perform anomaly detection training on traffic of various granularities for better coverage. ① For the traffic on the entire platform, train the anomaly perception model and set the abnormal sample ratio cnotallow=0.05; ② For each type of media traffic, train an anomaly perception model and set the abnormal sample ratio cnotallow=0.1; ③ For each type of ad slot traffic, train the anomaly perception model and set the abnormal sample ratio cnotallow=0.1. (4) Perception monitoring
Case ① On XX/XX/2022, 7*24-hour click volume anomaly detection, suspected malicious user A, for most of the past 7 days, generated more click records per hour on average, far exceeding normal users. (Note: Each point in features represents the number of clicks by a user in one hour) Case ② On XX/XX/2022, 1*24-hour click volume sequence anomaly detection found that a suspicious malicious user B basically only generated clicks in the early morning and had basically no clicks during the rest of the day. IV. ConclusionIn the field of traffic anti-fraud, with the upgrading of anti-countermeasures, algorithm models can better discover and mine the hidden fraud patterns of black industries; in the field of advertising traffic anti-fraud, we use supervised and unsupervised algorithm models to explore and mine applications in the identification of fraudulent traffic and abnormal traffic perception, effectively improving the recognition ability and mining more complex abnormal behavior patterns. In the future, algorithm models will explore more practical applications in machine traffic identification. |
<<: Does the cloud-native 5G core network need a DPU?
>>: Can Gaming Networks Use CAT 5 Cable? Here's What You Need to Know
China Unicom is currently actively promoting the ...
On March 29, according to foreign media reports, ...
ZJI has released this year's Double 11 promot...
HostKvm is launching a 20% discount coupon code t...
RepriseHosting is a long-established hosting comp...
In early December 2017, CommScope held a 20th ann...
Megalayer has launched a limited-time flash sale ...
Cabling is an important part of the internal netw...
We are not unfamiliar with number portability. As...
[51CTO.com original article] On November 7, as an...
According to data released by market tracker FnGu...
Today, data center operators worry about high ava...
Software-defined WANs (SD-WANs) have sparked a re...
What is 5G? 5G is the fifth generation of mobile ...
A good network also has a “30% to 70%” structure ...