Visualize missing values in R

Data quality is extremely important for analysis, modeling, and predictions. Microsoft data science utilities for Team Data Science Process contain a script (IDEAR), which can visualize missing values. You can specify the number of segments to split your data, in order to calculate average missing value rate for each segment, and visualize it with the levelplot.

I refactored this function a bit and replaced the levelplot with the ggplot. My plot_missing() function requires two packages: ggplot2 and reshape. The code of the MissingValues.R script can be found on the GitHub download

If you pass a data.frame with some missing values (NA) to the function, you will get a visual distribution of densities of the missing values:

plot_missing(data)

missing-plot1

The leftmost column “All” shows average missing value rates for the whole data set by variables.

You can change the number of segments and the color palette. For instance, you can use a palette from the RColorBrewer package:

if (!require(RColorBrewer))
   install.packages("RColorBrewer")

library(RColorBrewer)
plot_missing(data, 5, col = brewer.pal(n = 9, name = "Blues"))

missing-plot2

If you do not remember palette names, you can display them:

display.brewer.all()

Here is another nice palette:

plot_missing(data, col = brewer.pal(n = 9, name = "YlOrRd"))

missing-plot3

The source code: GitHub download

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s