Alibaba Engineers Have Worked out Loads of Methods to Detect Anomalies

By Li Weibin, Hu Yi, and Wang Hao.

The dark web and black markets are going larger and larger nowadays, and new methods for cheating the system are constantly popping up. These issues can make advertising less effective or even lead to sharp increases in app promotion costs. Therefore, as you can imagine, the precise identification of these issues, specifically these cheating methods, is an important expectation of any Internet company or advertiser.

In this blog, we will be exploring the issues of anomaly detection from multiple angles and see how Alibaba engineers have worked out several different methods for detecting anomalies. Some of the angles that the Alibaba engineers worked out are time series data, statistics, distance data, and the linear methods. In this blog, we will look into these angles as well as many, many others.

But, before we get to into it, let’s first discuss some background information.

Outlier detection is a process of detecting objects whose behavior are significantly different from that of the expected objects. These detected objects are called outliers. Outlier detection is widely used in production and life, such as anti-fraud for credit cards, industrial damage detection, and advertisement anti-cheating.

An outlier is a data object that is distinctly different to other data objects. As shown in Figure 1 below, the points in the N1 and N2 regions are normal data sets. However, points in O1, O2 and O3 regions do not fit into the normal patterns that you’ll find with N1 and N2, and so they are outliers.

Figure 1. Example of outliers

One of the difficulties in anomaly detection is the lack of ground truth, which just simply means information that is not inferred. A common method to fix this is to use unsupervised methods to mine abnormal samples, and then use supervised models to fuse multiple features to mine more cheating.

Recently, multiple algorithms have been used to mine outliers. The following sections look into the principle and application scenarios of the anomaly detection algorithm from several different angles. Due to service specificity, of course, this article does not cover all feature details.

1. Time Series

1.1 Moving Average (MA)

1.1.1 Simple Moving Average (SMA)

It is easy to see from the above formula that the average of the historical value can be used as the predicted value of the current value. In the scenario where the sequence value fluctuates less with time, if the difference between the above moving average and the real value at that time exceeds a certain threshold, the value at that time is determined to be abnormal.

Applicable to:

  1. Smooth the noise data, that is, replacing the current time value with the moving average to filter noise.
  2. Predict future values.

1.1.2 Weighted Moving Average (WMA)

The WMA is more sensitive to recent changes than the SMA. The WMA has less lag than the SMA. However, the WMA still has a certain lag because the weight only decreases linearly.

1.1.3 Exponential Moving Average (EMA)

indicates the degree of weight decrease. The value ranges from 0 to 1.

The larger the value, the faster the observed values in the past decrease.

1.2 Year-on-Year Ratio and Month-over-Month Ratio

Figure 2. Year-on-Year ratio and Month-over-Month ratio

The Year-on-Year and Month-over-Month formulas are shown in Figure 2. They are suitable for use in scenarios with periodic data. For example, it can be used for monitoring the Year-on-Year and Month-over-Month ratio for daily active users (DAU) of the app, to detect whether the number of daily active users (DAU) is rising or falling during a given period of time. As a second example, it can also be used for monitoring the real-time Year-over-Year and Month-over-Month ratio for clicks and consumption of advertisements, to detect changes over time. When the above ratio exceeds a certain threshold (for information on the threshold, please see section 10), it is determined that an anomaly occurs.

1.3 Anomaly Detection of Time Series Metrics (STL + GESD)

  1. To decompose the metrics into STL time series first, to obtain the Seasonal, Trend, and Residual components, as shown in Figure 3.
  2. To use the GESD (which stands for generalized extreme studentized deviate) algorithm to detect the anomaly of the Trend and Residual components.
  3. To replace the mean and std statistics in the GESD algorithm with median and MAD (median absolute deviation), to improve the robustness against the outlier.
  4. To output the anomaly score: abnorm_score = (value — median)/MAD, where the “value” is the current value, and “median” is the median of the sequence. A negative score indicates a decrease in anomalies, while a positive score indicates an increase in anomalies.
Figure 3. STL decomposition example

2. Statistics

2.1 Single Feature, Which Conform to Gaussian Distribution

, then its probability density function is:

We can use the existing sample data

to predict the

in the population. The computation method is as follows:

2.2 Multiple Unrelated Features, All of Which Conform to Gaussian Distribution

And each variable conforms to Gaussian distribution, then the average and variance of each dimension can be computed.

Specifically, for

, the following can be computed:

If a new piece of data

is provided, the probability

can be computed as follows:

2.3 Multiple Related Features, Which Conform to Multivariate Gaussian Distribution

, and each variable conforms to Gaussian distribution, then the covariance matrix of the n-dimensional average vector

can be computed:

If a new piece of data

is provided, the probability can be computed:

2.4 Mahalanobis Distance

is an average vector, then for any object

in the dataset D, the Mahalanobis distance from



is the covariance matrix. The value

can be sorted. If the value is too large, then the point can be considered as an outlier.

2.5 Boxplot

in normal distribution). As shown in Figure 4 below:

Figure 4. Schematic diagram of boxplot algorithm

3. Distance

3.1 Angle-Based Outlier Detection

Figure 5. Point set and angle

As shown in Figure 5 above, three points X, Y, and Z, and two vectors

exist now. If the change of angle

is small for any different point, such as Y and Z, then point X is an outlier. The angle is easily obtained using the cosine-angle formula:

If D is a point set, then for any different point

, the variance of all angles of point X is:

The above variance of the outlier is small. The time complexity of the algorithm is

, which is suitable for scenarios with a small data volume N.

3.2 KNN-Based Outlier Detection

, the sum of distances Dist(K,X) of its K nearest neighbors is computed. The greater the Dist(K,X), the more abnormal the point. The time complexity is

, where N is the size of the data amount.

4. Linear Method (Matrix Decomposition and PCA Dimension Reduction)

Assume that X is a p-dimensional dataset with N samples, and its covariance matrix is

. Then, the covariance matrix can be decomposed as

P is a

-dimensional orthogonal matrix, , and each of its columns

is an eigenvector. D is a

-dimensional diagonal matrix, which contains eigenvalues

. On a graph, an eigenvector can be regarded as a line in a 2-dimensional plane, or a plane in a higher dimensional space. The eigenvalues corresponding to the eigenvectors reflect the stretching degree of the batch of data in this direction. Generally, the eigenvalues in the eigenvalue matrix D are sorted from large to small, and each column of the eigenvector matrix P is adjusted accordingly.

The projection of dataset X on the principal component space can be written as Y = XP. Note that the projection can be performed only on some dimensions. The matrix after the principal component projection of top-j is:

is the first j column(s) of matrix p, that is,

is a

-dimensional matrix.

is the first j column(s) of matrix Y and is a

-dimensional matrix. In the same way, the data is projected from the principal component space to the original space. The reconstructed dataset is

The data set restructured by using the principal component of top-j is a dimension matrix. The matrix is shown in Figure 6.

Figure 6. Matrix transformation diagram

The outlier of

is defined as:


represents the ratio of the principal component of top-j to all the principal components. The eigenvalues are sorted in descending order. Therefore,

is incremental, which means the larger the j value, the more the variances included in

, because this is the summation from 1 to j. In this definition, the first principal component with the biggest deviant receives the smallest weight and the last principal component with the smallest deviant receives the biggest weight (1). Due to the characteristics of PCA, outliers have a relatively large variance on the last principal component, leading to a larger anomaly score.

5. Distribution

5.1 Relative Entropy (KL Divergence)

, the relative entropy is

5.2 Chi-square Test

and obtain the probability of the observed result. In the test statistics, O represents the observed value and E the expected value. The test statistics provides a method to measure the difference between the expected result and the observed result. The final determination is based on the probability table according to the significance level that has been set.

6. Tree (Isolation Forest)

Figure 7. iForest detection result

Isolation Forest: Suppose that we use a random hyperplane to split the data space. Each split can generate two sub-spaces. Then we keep splitting individual sub-spaces by using a random hyperplane until each sub-space contains only one data point. Clusters with a high density need to be split many times before each sub-space contains only one data point. However, a sub-space where the data point density is low can quickly be further split into smaller sub-spaces with each containing only one data point. As shown in Figure 7, the black points are outliers, which are included in a sub-space after the space is split several times; the white points are normal points, which are converged in a cluster. The outlier boundary detected by the isolation forest is the red line in Figure 7. The isolation forest can correctly detect all the black outliers.

As shown in Figure 8, iForest is used to divide four data entries. The height of b and c is 3, the height of a is 2, and the height of d is 1. Because d is isolated first, it is most likely to be an outlier.

Figure 8. iForest isolation

7. Graph

7.1 Maximal Connected Subgraph

Figure 9 shows the connections between device IDs and mbr IDs. An edge between two nodes indicates that the corresponding member has logged on to a specific device. From Figure 9, we know that the same member has logged on to device_1, device_2, device_3, and device_4. This can be used to determine fraud in certain scenarios, usually for spotting coordinated fraud groups.

Figure 9. Maximal connected subgraph result

The premise of the maximal connected subgraph is that each edge must have confidence. Applicable scenario: Find all connections. When an edge with low confidence exists, dirty data needs to be removed. Otherwise the validity of the maximal connected subgraph would be reduced.

7.2 Label Propagation for Clustering

Figure 10. Graph structure of the label propagation algorithm for clustering

A small number of links are allowed between the subgraphs of the label propagation for clustering. Applicable scenario: High cohesion and low coupling between nodes. In Figure 10, a subgraph is obtained from the maximal connected subgraph, and two subgraphs are obtained from the label propagation algorithm for clustering.

8. Behavior Sequence (Markov Chain)

Figure 11. User behavior state chart

To obtain a state transition matrix, we can collect any two adjacent states in a normal behavior sequence and calculate the probability of each state transiting to any other state. For each user behavior sequence to be detected, we can easily find the probability of that sequence. The larger the probability, the more normal the user behaviors are.

9. Supervised Models

9.1 Machine Learning Model GBDT

9.2 Deep Learning Model Wide & Deep

Deep uses neural network to freely combine and map features and features strong generalization. Deep extracts more common characteristics of the sample features for judgement, but this may lead to excessive generalization.

The algorithm combines the two types of features to achieve a balance between memorization and generalization.

To further strengthen the generalization ability of the model, we can use the samples mined by using the unsupervised methods in the preceding sections as our training samples for training the Wide & Deep model to identify fraud.

Figure 12. Wide & Deep model

10. Other Problems

10.1 Common Threshold Selection Methods

  1. Unsupervised methods: Use the quantile threshold and locate the inflection point of the distribution curve of historical data.
  2. Supervised models: Check the accuracy curve and the recall curve of the validation sets.

10.2 Transform Non-Gaussian Distributions to Gaussian Distributions


Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store