Data quality is extremely important for analysis, modeling, and predictions. Microsoft data science utilities for Team Data Science Process contain a script (IDEAR), which can visualize missing values. You can specify the number of segments to split your data, in order to calculate average missing value rate for each segment, and visualize it with the levelplot.
I refactored this function a bit and replaced the levelplot with the ggplot. My plot_missing() function requires two packages: ggplot2 and reshape. The code of the MissingValues.R script can be found on the GitHub
If you pass a data.frame with some missing values (NA) to the function, you will get a visual distribution of densities of the missing values:
The leftmost column “All” shows average missing value rates for the whole data set by variables.
You can change the number of segments and the color palette. For instance, you can use a palette from the RColorBrewer package:
if (!require(RColorBrewer)) install.packages("RColorBrewer") library(RColorBrewer) plot_missing(data, 5, col = brewer.pal(n = 9, name = "Blues"))
If you do not remember palette names, you can display them:
Here is another nice palette:
plot_missing(data, col = brewer.pal(n = 9, name = "YlOrRd"))
The source code: GitHub