Search code examples
rdataframestatisticsdata-analysis

Dealing with Missing Values for one Variable in R


I'm currently dealing with a data set that has missing values, but they are only missing for one single variable. I was trying to determine whether they are missing at random, so that I can simply remove them from the data frame. Hence, I am trying to find potential correlations between the NA's in the data frame and the values of the other variables. I found the following code online:

library("VIM")
data(sleep)
x <- as.data.frame(abs(is.na(sleep)))
head(sleep)
head(x)
y <- x[which(sapply(x, sd) > 0)]
cor(y)

However, this only shows you how the missing values themselves are correlated, in case there are distributed across all variables.

Is there a way to find not the correlation between the missing values in a data frame, but the correlation between the missing values of one variable and values of another variable? For example, if you have a survey which is optionally asking for family income, how could you determine whether the missing values are e.g. correlated with low income with R?


Solution

  • library(finalfit)
    library(dplyr)
    
    df <- data.frame(
      A = c(1,2,4,5),
      B = c(55,44,3,6),
      C = c(NA, 4, NA, 5)
    )
    
    df %>%
      missing_pairs("A", "C")
    

    enter image description here