My Toolkit for Anomaly Detection

Life is full of surprises. Our goal is to make a distinction between them and “normal” behavior. That is called Anomaly Detection. In fact, anomalies are most interesting things in Data Analysis. And it is always good to have a set of handy tools for that at hand. Here is my toolkit.

AnomalyDetection R package

Twitter’s AnomalyDetection is a popular and simple in use R package for time series anomaly analysis. The package uses a Seasonal Hybrid ESD (Extreme Studentized Deviate test) algorithm to identify local and global anomalies.

As an outcome of its work, we can get a data.frame with anomalous observations, and, if necessary, a plot with both the time series and the estimated anoms, indicated by circles:

sunspont-numbers

Outlier in psych R package

Dealing with multidimensional numeric or logical data, we can detect outliers, calculating Mahalanobis distance for each data point and then compare these to the expected values of Χ2. We can do it with the outlier function of the psych R package:

D2 <- outlier(dat, plot=TRUE, bad=5)

Looking at the Q-Q plot below, we can set a threshold for D2 to identify outliers, let’s say, above 18:

outlier-qq-plot

In other words, any observations, which Mahalanobis distances are above the threshold, can be considered as outliers.

Time Series Anomaly Detection in Azure ML

I like Microsoft Azure Machine Learning Studio. It contains a really powerful module for Time Series Anomaly Detection. It can measure:

  • the magnitude of upward and downward changes
  • direction and duration of trends: positive vs. negative changes

The module learns the pattern from the data, and adds two columns (Anomaly score and Alert indicator) to indicate values that are potentially anomalous:

azure-ml-ts-anomaly

One-Class Support Machine in Azure ML

This Azure ML module can be used when we have a lot of data, labeled as “normal” and not too many anomalous instances. One-class SVM learns a discriminative boundary around normal instances, and everything out of the boundary is considered as anomalous. Our responsibility is to tune model parameters and train it.

Running the experiment does the scoring of the data. The scored output adds two more columns to the dataset: Scored Labels and Score Probabilities. The Score Label is a 1 or a 0, where a 1 is representing an outlier:

azure-ml-svm-anomaly

PCA-Based Anomaly Detection in Azure ML

Like in case of One-class SVM, PCA-Based Anomaly Detection model is trained on normal data. The Scored dataset contains Scored Labels and Score Probabilities. But mind you that for the PCA-based model, the Scored Label 1 means normal data:

azure-ml-pca-anomaly

rxOneClassSvm in R

If we cannot use Cloud-based solutions (and Azure ML respectively) for some reasons, we can use rxOneClassSvm function, included into MicrosoftML R package. MicrosoftML is a package for Microsoft R Server, Microsoft R Client, and SQL Server Machine Learning Services.

The training set contains only examples from the normal class. In order to train a model we have to specify an R formula:

svmModel <- rxOneClassSvm(
   formula = ~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
   data = trainIris)

Scoring results include a variable Score:

scoreDF <- rxPredict(svmModel, 
   data = testIris, extraVarsToWrite = "isIris")
tail(scoreDF)
   isIris      Score
57      1 -0.3131609
58      1 -0.3095322
59      1 -0.1532502
60      1 -0.3937540
61      0  0.5537572
62      0  0.4861979

R documentation asserts:

“This algorithm will not attempt to load the entire dataset into memory.”

Hmm, quite a useful feature indeed!

What else?

In fact, there are much more packages for anomaly detection. We can use any binary or multi-class classifiers, cluster analysis, neural networks, kNN and many others. But this is my First Aid Kit.