Search code examples
rstatisticsr-mice

How to get pooled mean by group after multiple imputation?


Assume we use the following example from the mice package:

imp <- mice(nhanes, m=5, print=FALSE, seed=55152)

I want to obtain the means of nhanes$chl grouped by nhanes$hyp. Without using multiple imputation, one would do:

library(mice)
library(dplyr)

nhanes %>% 
  group_by(hyp) %>% 
  summarise(chl=mean(chl, na.rm=TRUE))

How can I obtain such pooled estimates of means of chl per group of hyp after multiple imputation with mice?


Solution

  • Without imputation, only rows with complete cases for chl and hyp are used, referred to as list-wise deletion. Using this method, we get these means by group:

    > tapply(X=nhanes$chl, INDEX=nhanes$hyp, FUN=mean, na.rm=TRUE)
           1        2 
    181.8182 229.0000 
    

    Multiple imputation,

    > m. <- 5
    > imp <- mice::mice(nhanes, m=m., print=FALSE, seed=55152)
    

    gives us m imputed datasets, that we usually combine in a long format. (Note, that m > 20 is recommended.1)

    > long <- mice::complete(imp, action='long')
    
    > with(long, table(.imp))  ## nrows by .imp
    .imp
     1  2  3  4  5 
    25 25 25 25 25 
    
    > nrow(nhanes)  ## nrows original
    [1] 25
    

    Each imputed dataset will have it's own statistics (i.e. group means in this case). For example .imp == 1 will have:

    > tapply(long[long$.imp == 1, ]$chl, long[long$.imp == 1, ]$hyp, mean, na.rm=TRUE)
           1        2 
    185.8421 212.5000 
    

    According to Rubin's rules, what we want is the mean of these imputed statistics to get the estimate. Accordingly we calculate the statistics for each imputation,

    > (r0 <- do.call('rbind', by(long, ~ .imp, \(x) tapply(x$chl, x$hyp, mean))))
             1        2
    1 185.8421 212.5000
    2 186.9000 231.0000
    3 187.9000 237.8000
    4 189.0000 232.5000
    5 188.1053 215.8333
    

    and take the means of these results.

    > colMeans(r0)
           1        2 
    187.5495 225.9267 
    

    We might report the estimates with standard errors, which we compute accordingly:

    > ## compute mean and variance for each imputation
    > R1 <- by(long, ~ .imp, \(x)
    +          sapply(c(mean=mean, var=\(x) var(x)/length(x)), \(f) 
    +                 tapply(x$chl, x$hyp, f))) |> 
    +   simplify2array()
    > 
    > Q <- rowMeans(R1[, 1, ])  ## calculate mean estimates
    > U <- rowMeans(R1[, 2, ])  ## calculate within variances
    > 
    > B <- rowSums(((R1[, 1, ] - Q)^2))/(m. - 1)  ## calculate between variances 
    > 
    > T <- U + (1 + 1/m.)*B  ## calculate total variances 
    > 
    > cbind(Estimate=Q, 'Std. Error'=sqrt(T))
      Estimate Std. Error
    1 187.5495   9.175414
    2 225.9267  21.404731
    

    Data:

    > nhanes <- mice::nhanes