Who is Data Science Engineer?

Who is Data Science Engineer, and is there any difference between them and Data Scientists? Your answer to this question is an indicator of success or failure of your Data Science project.

Software development is a risky business, but the risks rocket up when you give in to the persuasion of your client “to add some AI to the system”. Your next move is to hire Data Scientists and delegate this task to them.

“What are you doing?” – Scrum masters are wondering. “Building models” – they say. After a few months, you are starting to realize you cannot control this process. Why?

Because Data Science Process IS NOT Software Development Process! Any attempts to ignore this fact quickly bring the project to epic fail.

Data Science Process has its own set of inputs, outputs, roles, deliverables and process flow. Look at some them: TDSP, CRISP-DM, KDD, SEMMA. I hope your Project Managers are aware of them. But even if they are – epic fail is still your main option.

Because there is a gap between these two processes: Data Science and Software Development.

If your developers know everything about programming, they hardly know something about Statistics, Machine Learning, Data Science (excepting popular articles). Just as Data Scientists know nothing about building Line-of-Business Applications, and truly speaking are not too strong in programming. How are you going to cope with it?

ds-engineer

Project Manager staring at a Data Science project.

Data Science Engineer is the answer. You need a person

  • with strong programming skills
  • with basic Statistical skills
  • with Data Science Process understanding and ability to participate on each stage of the process
  • knowing specialized DS languages and tools like R, Tensorflow and so on.

Data Scientists and developers live in different worlds. Data Science Engineer lives in both. It is a magic adhesive tape without which your Data Science project will fall apart.

My Toolkit for Anomaly Detection

Life is full of surprises. Our goal is to make a distinction between them and “normal” behavior. That is called Anomaly Detection. In fact, anomalies are most interesting things in Data Analysis. And it is always good to have a set of handy tools for that at hand. Here is my toolkit.

AnomalyDetection R package

Twitter’s AnomalyDetection is a popular and simple in use R package for time series anomaly analysis. The package uses a Seasonal Hybrid ESD (Extreme Studentized Deviate test) algorithm to identify local and global anomalies.

As an outcome of its work, we can get a data.frame with anomalous observations, and, if necessary, a plot with both the time series and the estimated anoms, indicated by circles:

sunspont-numbers

Outlier in psych R package

Dealing with multidimensional numeric or logical data, we can detect outliers, calculating Mahalanobis distance for each data point and then compare these to the expected values of Χ2. We can do it with the outlier function of the psych R package:

D2 <- outlier(dat, plot=TRUE, bad=5)

Looking at the Q-Q plot below, we can set a threshold for D2 to identify outliers, let’s say, above 18:

outlier-qq-plot

In other words, any observations, which Mahalanobis distances are above the threshold, can be considered as outliers.

Time Series Anomaly Detection in Azure ML

I like Microsoft Azure Machine Learning Studio. It contains a really powerful module for Time Series Anomaly Detection. It can measure:

  • the magnitude of upward and downward changes
  • direction and duration of trends: positive vs. negative changes

The module learns the pattern from the data, and adds two columns (Anomaly score and Alert indicator) to indicate values that are potentially anomalous:

azure-ml-ts-anomaly

One-Class Support Machine in Azure ML

This Azure ML module can be used when we have a lot of data, labeled as “normal” and not too many anomalous instances. One-class SVM learns a discriminative boundary around normal instances, and everything out of the boundary is considered as anomalous. Our responsibility is to tune model parameters and train it.

Running the experiment does the scoring of the data. The scored output adds two more columns to the dataset: Scored Labels and Score Probabilities. The Score Label is a 1 or a 0, where a 1 is representing an outlier:

azure-ml-svm-anomaly

PCA-Based Anomaly Detection in Azure ML

Like in case of One-class SVM, PCA-Based Anomaly Detection model is trained on normal data. The Scored dataset contains Scored Labels and Score Probabilities. But mind you that for the PCA-based model, the Scored Label 1 means normal data:

azure-ml-pca-anomaly

rxOneClassSvm in R

If we cannot use Cloud-based solutions (and Azure ML respectively) for some reasons, we can use rxOneClassSvm function, included into MicrosoftML R package. MicrosoftML is a package for Microsoft R Server, Microsoft R Client, and SQL Server Machine Learning Services.

The training set contains only examples from the normal class. In order to train a model we have to specify an R formula:

svmModel <- rxOneClassSvm(
   formula = ~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
   data = trainIris)

Scoring results include a variable Score:

scoreDF <- rxPredict(svmModel, 
   data = testIris, extraVarsToWrite = "isIris")
tail(scoreDF)
   isIris      Score
57      1 -0.3131609
58      1 -0.3095322
59      1 -0.1532502
60      1 -0.3937540
61      0  0.5537572
62      0  0.4861979

R documentation asserts:

“This algorithm will not attempt to load the entire dataset into memory.”

Hmm, quite a useful feature indeed!

What else?

In fact, there are much more packages for anomaly detection. We can use any binary or multi-class classifiers, cluster analysis, neural networks, kNN and many others. But this is my First Aid Kit.

Got the Microsoft Professional Program Certificate in Data Science!

DSCertificate

A year ago I decided to take a course, dedicated to Machine Learning and Data Science. Microsoft offered a “Microsoft Professional Program for Data Science” on the basis of massive open online courses (MOOC) on the edX platform.

The program consists of 4 units of 9 courses and a final project (see more at https://academy.microsoft.com/en-us/professional-program/data-science/). Some of the units allow you to choose from different courses. For example, you can choose courses, requiring knowledge of R or Python.

I completed the following courses:

Course Length
Microsoft – DAT101x: Data Science Orientation 6 weeks
Microsoft – DAT201x: Querying with Transact-SQL 6 weeks
Microsoft – DAT207x: Analyzing and Visualizing Data with Power BI 6 weeks
ColumbiaX – DS101X: Statistical Thinking for Data Science and Analytics 5 weeks
Microsoft – DAT204x: Introduction to R for Data Science 4 weeks
Microsoft – DAT203.1x: Data Science Essentials 6 weeks
Microsoft – DAT203.2x: Principles of Machine Learning 6 weeks
Microsoft – DAT209x: Programming with R for Data Science 6 weeks
Microsoft – DAT203.3x: Applied Machine Learning 6 weeks
Microsoft Professional Capstone: Data Science 4 weeks

Each course, including a Capstone Project, costs $99 for a verified certificate. That way, you will pay $990 if Microsoft does not raise the price as they did it twice before it.

The courses are well structured: some theory, presented by a trainer with hands-on demos, Quizzes, Labs, and Exams.

The most enjoyable part for me was the Capstone Project. It is a competition, during which you have to predict some values having a bunch of data, and to write a report of your analysis and findings. You can use any techniques you want, but the final score depends on the accuracy of your predictions.

submissions

It was a really excellent experience, but I have to take a breath before I start looking for new courses.