Life is full of surprises. Our goal is to make a distinction between them and “normal” behavior. That is called Anomaly Detection. In fact, anomalies are most interesting things in Data Analysis. And it is always good to have a set of handy tools for that at hand. Here is my toolkit.
AnomalyDetection R package
Twitter’s AnomalyDetection is a popular and simple in use R package for time series anomaly analysis. The package uses a Seasonal Hybrid ESD (Extreme Studentized Deviate test) algorithm to identify local and global anomalies.
As an outcome of its work, we can get a
data.frame with anomalous observations, and, if necessary, a plot with both the time series and the estimated anoms, indicated by circles:
Outlier in psych R package
Dealing with multidimensional numeric or logical data, we can detect outliers, calculating Mahalanobis distance for each data point and then compare these to the expected values of Χ2. We can do it with the
outlier function of the psych R package:
D2 <- outlier(dat, plot=TRUE, bad=5)
Looking at the Q-Q plot below, we can set a threshold for D2 to identify outliers, let’s say, above 18:
In other words, any observations, which Mahalanobis distances are above the threshold, can be considered as outliers.
Time Series Anomaly Detection in Azure ML
I like Microsoft Azure Machine Learning Studio. It contains a really powerful module for Time Series Anomaly Detection. It can measure:
- the magnitude of upward and downward changes
- direction and duration of trends: positive vs. negative changes
The module learns the pattern from the data, and adds two columns (Anomaly score and Alert indicator) to indicate values that are potentially anomalous:
One-Class Support Machine in Azure ML
This Azure ML module can be used when we have a lot of data, labeled as “normal” and not too many anomalous instances. One-class SVM learns a discriminative boundary around normal instances, and everything out of the boundary is considered as anomalous. Our responsibility is to tune model parameters and train it.
Running the experiment does the scoring of the data. The scored output adds two more columns to the dataset: Scored Labels and Score Probabilities. The Score Label is a 1 or a 0, where a 1 is representing an outlier:
PCA-Based Anomaly Detection in Azure ML
Like in case of One-class SVM, PCA-Based Anomaly Detection model is trained on normal data. The Scored dataset contains Scored Labels and Score Probabilities. But mind you that for the PCA-based model, the Scored Label 1 means normal data:
rxOneClassSvm in R
If we cannot use Cloud-based solutions (and Azure ML respectively) for some reasons, we can use rxOneClassSvm function, included into MicrosoftML R package. MicrosoftML is a package for Microsoft R Server, Microsoft R Client, and SQL Server Machine Learning Services.
The training set contains only examples from the normal class. In order to train a model we have to specify an R formula:
svmModel <- rxOneClassSvm( formula = ~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = trainIris)
Scoring results include a variable
scoreDF <- rxPredict(svmModel, data = testIris, extraVarsToWrite = "isIris")
tail(scoreDF) isIris Score 57 1 -0.3131609 58 1 -0.3095322 59 1 -0.1532502 60 1 -0.3937540 61 0 0.5537572 62 0 0.4861979
R documentation asserts:
“This algorithm will not attempt to load the entire dataset into memory.”
Hmm, quite a useful feature indeed!
In fact, there are much more packages for anomaly detection. We can use any binary or multi-class classifiers, cluster analysis, neural networks, kNN and many others. But this is my First Aid Kit.