Who is Data Science Engineer?

Who is Data Science Engineer, and is there any difference between them and Data Scientists? Your answer to this question is an indicator of success or failure of your Data Science project.

Software development is a risky business, but the risks rocket up when you give in to the persuasion of your client “to add some AI to the system”. Your next move is to hire Data Scientists and delegate this task to them.

“What are you doing?” – Scrum masters are wondering. “Building models” – they say. After a few months, you are starting to realize you cannot control this process. Why?

Because Data Science Process IS NOT Software Development Process! Any attempts to ignore this fact quickly bring the project to epic fail.

Data Science Process has its own set of inputs, outputs, roles, deliverables and process flow. Look at some them: TDSP, CRISP-DM, KDD, SEMMA. I hope your Project Managers are aware of them. But even if they are – epic fail is still your main option.

Because there is a gap between these two processes: Data Science and Software Development.

If your developers know everything about programming, they hardly know something about Statistics, Machine Learning, Data Science (excepting popular articles). Just as Data Scientists know nothing about building Line-of-Business Applications, and truly speaking are not too strong in programming. How are you going to cope with it?

ds-engineer

Project Manager staring at a Data Science project.

Data Science Engineer is the answer. You need a person

  • with strong programming skills
  • with basic Statistical skills
  • with Data Science Process understanding and ability to participate on each stage of the process
  • knowing specialized DS languages and tools like R, Tensorflow and so on.

Data Scientists and developers live in different worlds. Data Science Engineer lives in both. It is a magic adhesive tape without which your Data Science project will fall apart.

Visualize missing values in R

Data quality is extremely important for analysis, modeling, and predictions. Microsoft data science utilities for Team Data Science Process contain a script (IDEAR), which can visualize missing values. You can specify the number of segments to split your data, in order to calculate average missing value rate for each segment, and visualize it with the levelplot.

I refactored this function a bit and replaced the levelplot with the ggplot. My plot_missing() function requires two packages: ggplot2 and reshape. The code of the MissingValues.R script can be found on the GitHub download

If you pass a data.frame with some missing values (NA) to the function, you will get a visual distribution of densities of the missing values:

plot_missing(data)

missing-plot1

The leftmost column “All” shows average missing value rates for the whole data set by variables.

You can change the number of segments and the color palette. For instance, you can use a palette from the RColorBrewer package:

if (!require(RColorBrewer))
   install.packages("RColorBrewer")

library(RColorBrewer)
plot_missing(data, 5, col = brewer.pal(n = 9, name = "Blues"))

missing-plot2

If you do not remember palette names, you can display them:

display.brewer.all()

Here is another nice palette:

plot_missing(data, col = brewer.pal(n = 9, name = "YlOrRd"))

missing-plot3

The source code: GitHub download

Configuring Machine Learning Services in SQL Server 2017

rconfig-title

In order to operationalize freshly installed SQL Server 2017 Machine Learning Services, we can spend quite a few hours (or days). I hope this article will help you to save your time.

I could refer the reader to the official documentation, but I am sure you’ve already been there.

Enable external scripts

The first step will be to enable execution of external scripts in the SQL Server, running Machine Learning Services. Connect to the SQL Server with the SQL Server Management Studio, and execute the following script:

sp_configure 'external scripts enabled', 1; 
RECONFIGURE WITH OVERRIDE;

Restart your SQL Server to make this change effective. After restarting SQL services, verify that SQL Server Launchpad service is running:

rconfig-services

Now you can check that script execution is working. Open a new Query window, and then run the following command:

EXEC sp_execute_external_script 
       @language =N'R',
       @script=N'OutputDataSet <- InputDataSet;',
       @input_data_1 =N'SELECT 1 AS hello'
WITH RESULT SETS (([hello] int not null));
GO

If the command executes without errors, you will see the result:

rconfig-hello

Enable implied authentication for the Launchpad

To enable authentication for the Launchpad, you should create a new SQL Login for SQLRUserGroup user. This process is well described in the https://docs.microsoft.com/en-us/sql/advanced-analytics/r/set-up-sql-server-r-services-in-database, but I will repeat these steps here:

  1. In SQL Server Management Studio, in Object Explorer, expand Security, right-click Logins, and select New Login.
  2. In the Login – New dialog box, select Search.
  3. Select the Object Types and Groups checkboxes.
  4. Click Advanced, verify that the location to search is the current computer, and then click Find Now.
  5. Scroll through the list of group accounts on the server until you find one beginning with SQLRUserGroup.
  6. The name of the group that’s associated with the Launchpad service for the default instance is always just SQLRUserGroup. Select this account only for the default instance.
  7. If you are using a named instance, the instance name is appended to the default name, SQLRUserGroup.
  8. Click OK and close dialog boxes.

Give users permission to run external scripts

If a user, running R scripts, does not have elevated permissions, you must grant the user permissions to run scripts in each database where external scripts will be used:

USE 
GO
GRANT EXECUTE ANY EXTERNAL SCRIPT TO [UserName]

Modify server resource pools

Under SQL server default settings, the R runtime will probably have insufficient memory. First, check the resources allocated to the default pool for the server:

SELECT * FROM sys.resource_governor_resource_pools WHERE name = 'default'

By default, the max_cpu, max_memory and cap_cpu are set to 100 percent:

rconfig-sql1

Now check the resources allocated to the default external resource pool:

SELECT * FROM sys.resource_governor_external_resource_pools WHERE name = 'default'

By default, the maximum memory that R can use is 20% of the remaining memory:

rconfig-sql2

The following settings will reduce the maximum computer memory that can be used by SQL Server and increase the memory allocated to external processes:

ALTER RESOURCE POOL "default" WITH (max_memory_percent = 60); 
ALTER EXTERNAL RESOURCE POOL "default" WITH (max_memory_percent = 40); 
ALTER RESOURCE GOVERNOR reconfigure; 
GO

The values are just suggested settings to start with.

Please remember that SQL Server Standard edition does not support resource governance, i.e. memory distribution between SQL server and R engine:

  • Standard Edition

    Has all the capabilities of in-database analytics included with Enterprise Edition, except for resource governance. Performance and scale is also limited: the data that can be processed must fit in server memory, and processing is limited to a single compute thread, even when using the RevoScaleR functions.

Source: https://docs.microsoft.com/en-us/sql/advanced-analytics/r/differences-in-r-features-between-editions-of-sql-server.

Upgrade R Services to the latest version

First verify server’s version of the R Services, running the SQL script:

EXEC sp_execute_external_script @language =N'R',
@script=N'myvar <- version$version.string
OutputDataSet <- as.data.frame(myvar);'

SQL Server 2017 goes with the R Services of version 3.3.3, which is quite an old one. But you can upgrade the version, if you install Microsoft Machine Learning Server 9.2.1 (or upgrade it to this version).

Read the article: Upgrade machine learning components in a SQL Server instance. The article Install Machine Learning Server for Windows has links to the Machine Learning Server installer. For instance, Machine Learning Server (Developer Edition) for Windows setup can be found here.

Start the installer ServerSetup.exe for Machine Learning Server. Be sure to run the installer on the computer that has the instance you want to upgrade. To upgrade an instance, select the checkbox next to the instance name. If you do not select an instance, a separate installation of Machine Learning Server is created, and the SQL Server libraries are unchanged:

rconfig-mlserver

Verify installed R version:

  1. Find the R.exe or Rgui.exe in the folder \R_SERVICES\bin\x64
    For example:
    C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\R_SERVICES\bin\x64\Rgui.exe
  2. Double-click Rgui.exe to start the R Console application.

The console will show the current R version:

rconfig-console

As an option, you can run aforementioned SQL script:

rconfig-version

Update local Microsoft R Client to the latest version

If you need to upgrade your local Microsoft R Client to the latest version to synchronize it with the server’s version, you can:

Useful Links

Jupyter Notebook R Kernel for Microsoft R Open

Have you ever tried to add an R Kernel to Jupyter Notebook, having installed R Tools for Visual Studio? If tried, you know that this is not a piece of cake. I’ll show you how to do this.

The Jupyter Notebook is a web application in which you can create and share documents with code, text, and visualizations. Fresh Jupyter Notebook contains only one Python kernel. If you are using R language, installed from CRAN, you can easily add it to the Jupyter, following instructions on the page Native R kernel for Jupyter.

But that’s another matter if you installed Microsoft R Open (MRO). In fact, it could be set up separately or as a part of R Tools for Visual Studio (RTVS). Above-mentioned recipe won’t work for MRO.

First, let’s install some packages to the MRO library folder. On my machine, this is C:\Program Files\Microsoft\R Client\R_SERVER\library

Open system menu. Find Microsoft R Client > RGui, and run it as Administrator. Copy the following code to the R Console and run it:

lib.path <- 'C:/Program Files/Microsoft/R Client/R_SERVER/library'
install.packages('devtools', dependencies = TRUE, lib = lib.path)
install.packages(c('repr', 'IRdisplay', 'evaluate', 'crayon', 'pbdZMQ', 'uuid', 'digest'), lib = lib.path)
devtools::install_github('IRkernel/IRkernel')

MRO on my machine has version 3.3.2, and I would try to install R kernel with the command:

IRkernel::installspec(name = 'ir33', displayname = 'R 3.3.2')

The result will be disappointing:

irkernel-error
In fact, the error message gives us a clue: jupyter.exe cannot be found. Find a folder where the Anaconda is installed. On my machine, it is in the folder C:\Program Files\Anaconda3. Add folder C:\Program Files\Anaconda3\Scripts to the PATH environment variable of your account:

irkernel-edit-env
Close and open RGui and try again:

IRkernel::installspec(name = 'ir33', displayname = 'R 3.3.2')

Success! Now open system menu, and run Anaconda3 > Jupyter Notebook. R kernel should be in the list of notebooks:

irkernel-menu

Now you can remove C:\Program Files\Anaconda3\Scripts from the PATH variable if you don’t need it anymore.

Create a new R notebook and check that it is working now:

irkernel-rnotebook

Added kernels could be found in the AppData\Roaming\jupyter\kernels folder like this: C:\Users\Eugene\AppData\Roaming\jupyter\kernels

If you don’t need a kernel, just remove it from this folder.

My Toolkit for Anomaly Detection

Life is full of surprises. Our goal is to make a distinction between them and “normal” behavior. That is called Anomaly Detection. In fact, anomalies are most interesting things in Data Analysis. And it is always good to have a set of handy tools for that at hand. Here is my toolkit.

AnomalyDetection R package

Twitter’s AnomalyDetection is a popular and simple in use R package for time series anomaly analysis. The package uses a Seasonal Hybrid ESD (Extreme Studentized Deviate test) algorithm to identify local and global anomalies.

As an outcome of its work, we can get a data.frame with anomalous observations, and, if necessary, a plot with both the time series and the estimated anoms, indicated by circles:

sunspont-numbers

Outlier in psych R package

Dealing with multidimensional numeric or logical data, we can detect outliers, calculating Mahalanobis distance for each data point and then compare these to the expected values of Χ2. We can do it with the outlier function of the psych R package:

D2 <- outlier(dat, plot=TRUE, bad=5)

Looking at the Q-Q plot below, we can set a threshold for D2 to identify outliers, let’s say, above 18:

outlier-qq-plot

In other words, any observations, which Mahalanobis distances are above the threshold, can be considered as outliers.

Time Series Anomaly Detection in Azure ML

I like Microsoft Azure Machine Learning Studio. It contains a really powerful module for Time Series Anomaly Detection. It can measure:

  • the magnitude of upward and downward changes
  • direction and duration of trends: positive vs. negative changes

The module learns the pattern from the data, and adds two columns (Anomaly score and Alert indicator) to indicate values that are potentially anomalous:

azure-ml-ts-anomaly

One-Class Support Machine in Azure ML

This Azure ML module can be used when we have a lot of data, labeled as “normal” and not too many anomalous instances. One-class SVM learns a discriminative boundary around normal instances, and everything out of the boundary is considered as anomalous. Our responsibility is to tune model parameters and train it.

Running the experiment does the scoring of the data. The scored output adds two more columns to the dataset: Scored Labels and Score Probabilities. The Score Label is a 1 or a 0, where a 1 is representing an outlier:

azure-ml-svm-anomaly

PCA-Based Anomaly Detection in Azure ML

Like in case of One-class SVM, PCA-Based Anomaly Detection model is trained on normal data. The Scored dataset contains Scored Labels and Score Probabilities. But mind you that for the PCA-based model, the Scored Label 1 means normal data:

azure-ml-pca-anomaly

rxOneClassSvm in R

If we cannot use Cloud-based solutions (and Azure ML respectively) for some reasons, we can use rxOneClassSvm function, included into MicrosoftML R package. MicrosoftML is a package for Microsoft R Server, Microsoft R Client, and SQL Server Machine Learning Services.

The training set contains only examples from the normal class. In order to train a model we have to specify an R formula:

svmModel <- rxOneClassSvm(
   formula = ~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
   data = trainIris)

Scoring results include a variable Score:

scoreDF <- rxPredict(svmModel, 
   data = testIris, extraVarsToWrite = "isIris")
tail(scoreDF)
   isIris      Score
57      1 -0.3131609
58      1 -0.3095322
59      1 -0.1532502
60      1 -0.3937540
61      0  0.5537572
62      0  0.4861979

R documentation asserts:

“This algorithm will not attempt to load the entire dataset into memory.”

Hmm, quite a useful feature indeed!

What else?

In fact, there are much more packages for anomaly detection. We can use any binary or multi-class classifiers, cluster analysis, neural networks, kNN and many others. But this is my First Aid Kit.

R pairs chart in Power BI

As a rule, we are using Power BI to present our findings, creating dashboards or reports. But Microsoft Power BI can be useful on the stage of initial exploratory data analysis as well.

I found it when I needed to examine a really wide data table, containing hundreds of columns. Usually, I am writing an R script, creating Scatterplot matrices using pairs(). But having a lot of features, and wishing to browse them in different combinations, that would be a bit onerously.

That is why I created a ggpairs R Visual, showing the same chart in Power BI. There are two reasons for that. First, I can quickly select features to display, simply marking them on the “Fields” pane in Power BI. Secondly, Power BI has a lot of Data Sources which could be accessed much easy than in R.

r-pairs

Of course, Power BI has a few drawbacks. It is trying to refresh a chart every time you are selecting/deselecting fields. It is annoying. And do not forget about the data size limitation in R Visuals – Power BI takes no more than first 150,000 rows.

The source code of the ggpairs.R can be found there download