Model application in anti-fraud risk control of advertising traffic

Model application in anti-fraud risk control of advertising traffic

1. Introduction to Ad Anti-Cheat

1.1 Definition of Ad Traffic Anti-Cheat

Advertising traffic cheating refers to the media using various cheating methods to gain benefits from advertisers.

Fraudulent traffic mainly comes from:

  • Ad traffic from simulators or tampered devices;
  • Real equipment, but traffic controlled by group control;
  • Real people and real machines, but they induce invalid traffic, etc.

1.2 Common cheating behaviors

  • Machine Behavior:
  • Artificial behavior: interactive elements in materials induce clicks, media rendering copy induces clicks, sudden pop-ups and accidental clicks, etc.

1.3 Common types of cheating

According to the order of advertising process

  • Display fraud: The media exposes multiple display ads at the same time in the same ad space and charges advertisers for the display of multiple ads.
  • Click fraud: Simulating real users through scripts or computer programs, or hiring and incentivizing users to click, generating a large number of useless ad clicks, and obtaining advertisers' CPC advertising budgets.
  • Installation cheating: simulating downloads through test machines or simulators, modifying device information through mobile manual or technical means, sending virtual information through SDK, simulating downloads, etc.

2. Advertising Traffic Anti-Cheat Algorithm System

2.1 Application Background of Algorithm Models in Business Risk Control

Intelligent risk control uses a large amount of behavioral data to build models to identify and perceive risks. Compared with rule-based strategies, it significantly improves the accuracy, coverage and stability of identification.

Common unsupervised algorithms:

  • Density Clustering (DBSCAN)
  • Isolation Forest
  • K-means algorithm

Common supervised algorithms:

  • Logistic regression
  • Random forest

2.2 Advertising Traffic Model Algorithm System

The system is divided into four layers:

  • Platform layer: mainly based on the spark-ml/tensorflow/torch algorithm framework, and references open source and custom-developed algorithms for business risk control modeling.
  • Data layer: Build a portrait and feature system for multiple conversion processes such as request, exposure, click, download, activation, etc. at multiple granularities such as vaid/ip/media/ad slot to serve algorithm modeling.
  • Business model layer: Based on behavioral data characteristics and portrait data, build click anti-fraud audit models, request click risk estimation models, media behavior similarity group models, and media granularity anomaly perception models.
  • Access layer: application of model data, summary of offline click anti-fraud model audit results and policy identification audit results, and synchronization of downstream business penalties; media anomaly perception model is mainly used as a candidate list synchronization inspection platform and automated inspection.

3. Application Cases of Algorithm Models

3.1 Material Interaction Induces Perception

Background: Adding a virtual X close button to the ad creative causes users to click the fake X button when closing the ad, resulting in invalid click traffic and affecting user experience. The left picture shows the original creative, and the right picture shows a heat map of the coordinates of user clicks. The virtual X causes invalid click traffic when users close the ad.

Model recognition perception:

1. Density Clustering (DBSCAN):

Let’s define a few concepts first:

  • Neighborhood: For any given sample x and distance ε, the ε neighborhood of x refers to the set of samples whose distance to x does not exceed ε;
  • Core object: If the ε neighborhood of sample x contains at least minPts samples, then x is a core object;
  • Density direct access: If sample b is in the ε neighborhood of a, and a is a core object, then sample b is said to be density directly accessible from sample x;
  • Density reachable: For samples a and b, if there exist samples p1, p2, ..., pn, where p1 = a, pn = b, and each sample in the sequence is density-reachable with its previous sample, then samples a and b are said to be density-reachable;
  • Density connected: For samples a and b, if there exists a sample k such that a and k are density reachable, and k and b are density reachable, then a and b are density connected;
  • The concept of cluster defined is: the set of samples connected by the maximum density derived from the density reachable relationship is a cluster of the final clustering.



2. Apply algorithms to perceive ads that induce mis-touch:

① First, group the click data by resolution and ad position, and filter out the smaller groups;

② For each group, use the density clustering algorithm to cluster, set the neighborhood density threshold to 10, radius ε=5, and perform clustering training;

③For each group, after density clustering, filter out clusters with smaller cluster areas. The specific training code is as follows:

④Effect monitoring and crackdown: for the mined clusters, associate the post-click indicators, re-check the ad slots with abnormal conversion indicators, and deal with the ad slots with problems in the re-check.

3.2 Click Anti-Cheat Model

3.2.1 Background

A cheating click identification model is established for the click link of advertisements to improve the anti-cheating audit coverage, discover high-dimensional hidden cheating behaviors, and effectively supplement the strategic anti-cheating audit of click scenarios.

3.2.2 Construction Process


(1) Feature construction

Based on the token granularity, calculate the granular features of devices, IP addresses, media, and ad slots before an event occurs.

Frequency characteristics: exposure, click, and installation behavior characteristics in the past 1 minute, 5 minutes, 30 minutes, 1 hour, 1 day, 7 days, etc., that is, the corresponding mean, variance, dispersion and other characteristics;

Basic attribute characteristics: media type, advertising type, device legitimacy, IP type, network type, device value level, etc.

2. Model training and results

① Sample selection:

  • Sample balance processing: Online cheating samples and non-cheating samples are unbalanced, so the non-cheating samples are downsampled to achieve a balance between the cheating and non-cheating sample volumes (1:1)
  • Robust sample selection: The number of online non-cheating samples is large, and the group behaviors are diverse and unevenly distributed. In order to cover all behavior patterns after the small sample training is launched,
  • Use the K-means algorithm: group the online non-cheating samples, and then downsample each group according to the proportion to obtain non-cheating samples for training.

② Feature preprocessing:

  • Count the missing rate of each feature and remove features with a missing rate greater than 50%;
  • Feature contribution screening: calculate the discrimination of each feature on the predicted label Y and filter out features with a contribution less than 0.001;
  • Feature stability screening: Before the model goes online, select samples from the maximum and minimum time periods, calculate the PSI value of each feature in the two time periods, filter out features with a PSI value (Population Stability Index) greater than 0.2, and retain features with better stability.

③ Model training:

The random forest algorithm is used to classify click fraud behaviors. Random forest has many advantages, such as:

(1) It can process very high-dimensional data without feature selection;

(2) An unbiased estimate is used for the generalization error, and the model has strong generalization ability;

(3) Fast training speed and easy to parallelize (trees are independent of each other during training);

(4) Strong anti-overfitting ability;

Hyperparameter search optimization, use ParamGridBuilder, configure max_depth (maximum tree depth), numTrees (number of trees) and other hyperparameters to search and optimize the optimal hyperparameters.

④ Model stability monitoring:

After the model is launched, if features migrate over time, there will be changes in the feature distribution at inference time and at training time, and the model stability needs to be monitored and updated iteratively;

First, archive the training samples of the current version, calculate the PSI value of each feature of the inference time data and the training time data, and visualize the calculated PSI value (Population Stability Index) for daily monitoring and alarm.

⑤ Model interpretability monitoring:

After the model is launched, in order to more intuitively locate the cause of the risk of hitting the model, the inference data is monitored for interpretability. That is, for each piece of data, its impact on the predicted label is calculated;

The Shapley value (Shapley Additive explanation) is used to explain how features affect the output of the model. The Shapley value is calculated and output to the visualization platform for daily operational analysis.

3.3 Click sequence anomaly detection

3.3.1 Background

Through the user's hourly click volume sequence, we can mine the devices corresponding to malicious behaviors, and mine and detect abnormal pattern user groups that are far away from the majority of normal behaviors, such as abnormal groups that only have low-frequency click behaviors between 0:00 a.m. and 6:00 a.m. and no click behaviors at other times, or abnormal pattern users with balanced click behaviors every hour, etc.

3.3.2 Construction Process

(1) Feature construction

Taking the device as the user, the number of clicks per hour in the past 1/7/30 days is counted to form 1*24 hours, 7*24 hours, and 30*24 hours click sequence. The constructed features have the feature completeness on the time scale and the continuity condition of each feature data, which is suitable for anomaly detection algorithms.

(2) Model selection

Isolation forest outlier detection algorithm,The algorithm is based on two theoretical assumptions, namely, the proportion of abnormal data in the total sample size is very small, and the characteristic values ​​of abnormal points are very different from those of normal points.

Detect points that are sparsely distributed and far away from the densely populated group. For example, as can be seen in the figure below, the relatively more abnormal Xo only needs 4 cuts to be separated from the whole, that is, it is "isolated", while the more normal Xi point is separated from the whole after 11 cuts.

(3) Model training

The IsolationForest algorithm is used to perform anomaly detection training on traffic of various granularities for better coverage.

① For the traffic on the entire platform, train the anomaly perception model and set the abnormal sample ratio cnotallow=0.05;

② For each type of media traffic, train an anomaly perception model and set the abnormal sample ratio cnotallow=0.1;

③ For each type of ad slot traffic, train the anomaly perception model and set the abnormal sample ratio cnotallow=0.1.

(4) Perception monitoring

  • Anomaly score definition : If the anomaly score is close to 1, then it must be an anomaly point. If the anomaly score is much less than 0.5, it must not be an anomaly point.
  • Abnormal screening: Screen users with abnormal scores greater than 0.7 as high-risk groups, and those with scores between 0.5 and 0.7 as medium-risk groups. For the medium- and high-risk groups, the synchronous audit platform conducts a manual secondary audit;
  • Case Study:

Case ①

On XX/XX/2022, 7*24-hour click volume anomaly detection, suspected malicious user A, for most of the past 7 days, generated more click records per hour on average, far exceeding normal users.

(Note: Each point in features represents the number of clicks by a user in one hour)

Case ②

On XX/XX/2022, 1*24-hour click volume sequence anomaly detection found that a suspicious malicious user B basically only generated clicks in the early morning and had basically no clicks during the rest of the day.


IV. Conclusion

In the field of traffic anti-fraud, with the upgrading of anti-countermeasures, algorithm models can better discover and mine the hidden fraud patterns of black industries; in the field of advertising traffic anti-fraud, we use supervised and unsupervised algorithm models to explore and mine applications in the identification of fraudulent traffic and abnormal traffic perception, effectively improving the recognition ability and mining more complex abnormal behavior patterns. In the future, algorithm models will explore more practical applications in machine traffic identification.

<<:  Does the cloud-native 5G core network need a DPU?

>>:  Can Gaming Networks Use CAT 5 Cable? Here's What You Need to Know

Recommend

Network cabling management issues in data center transformation and upgrading

Cabling is an important part of the internal netw...

Best Practices for Data Center Disaster Recovery

Today, data center operators worry about high ava...

How network segmentation strategies work with SD-WAN

Software-defined WANs (SD-WANs) have sparked a re...

How do Huawei, ZTE and the three major operators plan for 5G?

What is 5G? 5G is the fifth generation of mobile ...