Search code examples
rmissing-data

R function to determine if missing data is related


I have a dataset with quite a bit of missing data in some columns (~20%) and am trying to figure out what proportion of these are in the same patients (ex. are the 20% of patients missing heart rate the same 20% that are missing systolic blood pressure?). The main purpose of this is to determine whether it is more common for data to be missing in patients with particular outcomes. I've tried to use the varclus package in R but I haven't been having any luck. Any suggestions and guidance is greatly appreciated, thank you! :)


Solution

  • Here's a tidyverse workflow to visualize missingness across your dataset:

    library(dplyr)
    library(tidyr)
    library(ggplot2)
    
    starwars %>% 
      mutate(across(everything(), is.na)) %>% 
      arrange(across(everything())) %>% 
      mutate(row = row_number()) %>% 
      pivot_longer(!row, names_to = "column", values_to = "missing") %>% 
      ggplot() +
      geom_tile(aes(row, column, fill = missing))
    

    For starters, it looks like the same rows tend to be missing species, sex, and gender. To confirm, we can do:

    starwars %>% 
      count(across(c(species, sex, gender), is.na))
    
    #> # A tibble: 2 × 4
    #>   species sex   gender     n
    #>   <lgl>   <lgl> <lgl>  <int>
    #> 1 FALSE   FALSE FALSE     83
    #> 2 TRUE    TRUE  TRUE       4
    

    Created on 2022-10-24 with reprex v2.0.2

    This confirms that in all cases where species, sex, and gender are missing, the other two are missing as well.

    PS - the mice package has more tools for exploring missing data.