My Toolkit for Anomaly Detection

Life is full of surprises. Our goal is to make a distinction between them and “normal” behavior. That is called Anomaly Detection. In fact, anomalies are most interesting things in Data Analysis. And it is always good to have a set of handy tools for that at hand. Here is my toolkit.

AnomalyDetection R package

Twitter’s AnomalyDetection is a popular and simple in use R package for time series anomaly analysis. The package uses a Seasonal Hybrid ESD (Extreme Studentized Deviate test) algorithm to identify local and global anomalies.

As an outcome of its work, we can get a data.frame with anomalous observations, and, if necessary, a plot with both the time series and the estimated anoms, indicated by circles:

sunspont-numbers

Outlier in psych R package

Dealing with multidimensional numeric or logical data, we can detect outliers, calculating Mahalanobis distance for each data point and then compare these to the expected values of Χ2. We can do it with the outlier function of the psych R package:

D2 <- outlier(dat, plot=TRUE, bad=5)

Looking at the Q-Q plot below, we can set a threshold for D2 to identify outliers, let’s say, above 18:

outlier-qq-plot

In other words, any observations, which Mahalanobis distances are above the threshold, can be considered as outliers.

Time Series Anomaly Detection in Azure ML

I like Microsoft Azure Machine Learning Studio. It contains a really powerful module for Time Series Anomaly Detection. It can measure:

  • the magnitude of upward and downward changes
  • direction and duration of trends: positive vs. negative changes

The module learns the pattern from the data, and adds two columns (Anomaly score and Alert indicator) to indicate values that are potentially anomalous:

azure-ml-ts-anomaly

One-Class Support Machine in Azure ML

This Azure ML module can be used when we have a lot of data, labeled as “normal” and not too many anomalous instances. One-class SVM learns a discriminative boundary around normal instances, and everything out of the boundary is considered as anomalous. Our responsibility is to tune model parameters and train it.

Running the experiment does the scoring of the data. The scored output adds two more columns to the dataset: Scored Labels and Score Probabilities. The Score Label is a 1 or a 0, where a 1 is representing an outlier:

azure-ml-svm-anomaly

PCA-Based Anomaly Detection in Azure ML

Like in case of One-class SVM, PCA-Based Anomaly Detection model is trained on normal data. The Scored dataset contains Scored Labels and Score Probabilities. But mind you that for the PCA-based model, the Scored Label 1 means normal data:

azure-ml-pca-anomaly

rxOneClassSvm in R

If we cannot use Cloud-based solutions (and Azure ML respectively) for some reasons, we can use rxOneClassSvm function, included into MicrosoftML R package. MicrosoftML is a package for Microsoft R Server, Microsoft R Client, and SQL Server Machine Learning Services.

The training set contains only examples from the normal class. In order to train a model we have to specify an R formula:

svmModel <- rxOneClassSvm(
   formula = ~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
   data = trainIris)

Scoring results include a variable Score:

scoreDF <- rxPredict(svmModel, 
   data = testIris, extraVarsToWrite = "isIris")
tail(scoreDF)
   isIris      Score
57      1 -0.3131609
58      1 -0.3095322
59      1 -0.1532502
60      1 -0.3937540
61      0  0.5537572
62      0  0.4861979

R documentation asserts:

“This algorithm will not attempt to load the entire dataset into memory.”

Hmm, quite a useful feature indeed!

What else?

In fact, there are much more packages for anomaly detection. We can use any binary or multi-class classifiers, cluster analysis, neural networks, kNN and many others. But this is my First Aid Kit.

Temperature Sensor for Windows 10 IoT Core

I know it is not a rocket science to connect a sensor to your computer and read its measurements. But if you haven’t been doing it before, I will show you a simple experiment with a Raspberry Pi and a temperature sensor.

temp-sensor

I used a Raspberry Pi 3 Model B, though Pi 2 would be of use as well. If you already have Windows 10 IoT Core installed on your Raspberry Pi – that’s fine. If not, there is a good guide how to install it: How to install Windows 10 IoT Core on Raspberry Pi 3. The only remark I would add is my advice to format your MicroSD card with an SD Memory Card Formatter before installing any OS on it.

temp-sensor1

Next, we will need a sensor. I bought a cheap (and not too accurate) Humidity and Temperature Sensor DHT11. It has three pins:

  1. Power supply 3 – 5.5 V DC (+)
  2. Serial data output
  3. Ground (-)

Having everything in hand, connect the sensor to the Raspberry Pi as shown below:

temp-sensor-connection

When everything is set up, connected and switched on, open Windows 10 IoT Core Dashboard and connect to your Raspberry Pi (see instructions if something went wrong).

Find your device in the list, right-click on it and select Open in Device Portal. This will launch the Windows Device Portal. The Windows Device Portal lets you manage your device remotely over a network. In order to enable remote display functionality on your Raspberry, choose Remote from the options on the left, and Enable Window IoT Remote Server.

win-core-screenInstall the Windows IoT Remote Client on your workstation and run it. Connect to your Raspberry Pi device through the client. You should see a window like this one on the left.

Now, you can download a source code of the TemperatureSensor project from GitHub: https://github.com/jevgenij-p/blog/tree/master/Temperature%20Sensor.

Open the project in the Visual Studio 2017 and build it. You will find a separate library Sensors.Dht, written by Daniel Porrey. I used his code to read sensor’s data.

Set the solution platform to ARM and the target as Remote Machine in the tool bar:

vs-menu

Open the TemperatureSensor project properties and choose Debug page. Check that Remote machine field contains your Raspberry Pi device name or IP address:

vs-debug-properties

Press F5 to build and deploy the application to your Raspberry Pi. The sensor should start sending temperature and humidity values every second, and the Windows IoT Remote Client will show a screen like this one:

sensor-screen

Voila!

The source code of the Visual Studio project can be found there download

R pairs chart in Power BI

As a rule, we are using Power BI to present our findings, creating dashboards or reports. But Microsoft Power BI can be useful on the stage of initial exploratory data analysis as well.

I found it when I needed to examine a really wide data table, containing hundreds of columns. Usually, I am writing an R script, creating Scatterplot matrices using pairs(). But having a lot of features, and wishing to browse them in different combinations, that would be a bit onerously.

That is why I created a ggpairs R Visual, showing the same chart in Power BI. There are two reasons for that. First, I can quickly select features to display, simply marking them on the “Fields” pane in Power BI. Secondly, Power BI has a lot of Data Sources which could be accessed much easy than in R.

r-pairs

Of course, Power BI has a few drawbacks. It is trying to refresh a chart every time you are selecting/deselecting fields. It is annoying. And do not forget about the data size limitation in R Visuals – Power BI takes no more than first 150,000 rows.

The source code of the ggpairs.R can be found there download

Got the Microsoft Professional Program Certificate in Data Science!

DSCertificate

A year ago I decided to take a course, dedicated to Machine Learning and Data Science. Microsoft offered a “Microsoft Professional Program for Data Science” on the basis of massive open online courses (MOOC) on the edX platform.

The program consists of 4 units of 9 courses and a final project (see more at https://academy.microsoft.com/en-us/professional-program/data-science/). Some of the units allow you to choose from different courses. For example, you can choose courses, requiring knowledge of R or Python.

I completed the following courses:

Course Length
Microsoft – DAT101x: Data Science Orientation 6 weeks
Microsoft – DAT201x: Querying with Transact-SQL 6 weeks
Microsoft – DAT207x: Analyzing and Visualizing Data with Power BI 6 weeks
ColumbiaX – DS101X: Statistical Thinking for Data Science and Analytics 5 weeks
Microsoft – DAT204x: Introduction to R for Data Science 4 weeks
Microsoft – DAT203.1x: Data Science Essentials 6 weeks
Microsoft – DAT203.2x: Principles of Machine Learning 6 weeks
Microsoft – DAT209x: Programming with R for Data Science 6 weeks
Microsoft – DAT203.3x: Applied Machine Learning 6 weeks
Microsoft Professional Capstone: Data Science 4 weeks

Each course, including a Capstone Project, costs $99 for a verified certificate. That way, you will pay $990 if Microsoft does not raise the price as they did it twice before it.

The courses are well structured: some theory, presented by a trainer with hands-on demos, Quizzes, Labs, and Exams.

The most enjoyable part for me was the Capstone Project. It is a competition, during which you have to predict some values having a bunch of data, and to write a report of your analysis and findings. You can use any techniques you want, but the final score depends on the accuracy of your predictions.

submissions

It was a really excellent experience, but I have to take a breath before I start looking for new courses.