I have a dataset with quite a bit of missing data in some columns (~20%) and am trying to figure out what proportion of these are in the same patients (ex. are the 20% of patients missing heart rate the same 20% that are missing systolic blood pressure?). The main purpose of this is to determine whether it is more common for data to be missing in patients with particular outcomes. I've tried to use the varclus package in R but I haven't been having any luck. Any suggestions and guidance is greatly appreciated, thank you! :)
Here's a tidyverse workflow to visualize missingness across your dataset:
library(dplyr)
library(tidyr)
library(ggplot2)
starwars %>%
mutate(across(everything(), is.na)) %>%
arrange(across(everything())) %>%
mutate(row = row_number()) %>%
pivot_longer(!row, names_to = "column", values_to = "missing") %>%
ggplot() +
geom_tile(aes(row, column, fill = missing))
For starters, it looks like the same rows tend to be missing species
, sex
, and gender
. To confirm, we can do:
starwars %>%
count(across(c(species, sex, gender), is.na))
#> # A tibble: 2 × 4
#> species sex gender n
#> <lgl> <lgl> <lgl> <int>
#> 1 FALSE FALSE FALSE 83
#> 2 TRUE TRUE TRUE 4
Created on 2022-10-24 with reprex v2.0.2
This confirms that in all cases where species
, sex
, and gender
are missing, the other two are missing as well.
PS - the mice package has more tools for exploring missing data.