How to get pooled mean by group after multiple imputation?

Assume we use the following example from the mice package:

imp <- mice(nhanes, m=5, print=FALSE, seed=55152)

I want to obtain the means of nhanes$chl grouped by nhanes$hyp. Without using multiple imputation, one would do:

library(mice)
library(dplyr)

nhanes %>% 
  group_by(hyp) %>% 
  summarise(chl=mean(chl, na.rm=TRUE))

How can I obtain such pooled estimates of means of chl per group of hyp after multiple imputation with mice?

Solution

Without imputation, only rows with complete cases for chl and hyp are used, referred to as list-wise deletion. Using this method, we get these means by group:

> tapply(X=nhanes$chl, INDEX=nhanes$hyp, FUN=mean, na.rm=TRUE)
       1        2 
181.8182 229.0000

Multiple imputation,

> m. <- 5
> imp <- mice::mice(nhanes, m=m., print=FALSE, seed=55152)

gives us m imputed datasets, that we usually combine in a long format. (Note, that m > 20 is recommended.¹)

> long <- mice::complete(imp, action='long')

> with(long, table(.imp))  ## nrows by .imp
.imp
 1  2  3  4  5 
25 25 25 25 25 

> nrow(nhanes)  ## nrows original
[1] 25

Each imputed dataset will have it's own statistics (i.e. group means in this case). For example .imp == 1 will have:

> tapply(long[long$.imp == 1, ]$chl, long[long$.imp == 1, ]$hyp, mean, na.rm=TRUE)
       1        2 
185.8421 212.5000

According to Rubin's rules, what we want is the mean of these imputed statistics to get the estimate. Accordingly we calculate the statistics for each imputation,

> (r0 <- do.call('rbind', by(long, ~ .imp, \(x) tapply(x$chl, x$hyp, mean))))
         1        2
1 185.8421 212.5000
2 186.9000 231.0000
3 187.9000 237.8000
4 189.0000 232.5000
5 188.1053 215.8333

and take the means of these results.

> colMeans(r0)
       1        2 
187.5495 225.9267

We might report the estimates with standard errors, which we compute accordingly:

> ## compute mean and variance for each imputation
> R1 <- by(long, ~ .imp, \(x)
+          sapply(c(mean=mean, var=\(x) var(x)/length(x)), \(f) 
+                 tapply(x$chl, x$hyp, f))) |> 
+   simplify2array()
> 
> Q <- rowMeans(R1[, 1, ])  ## calculate mean estimates
> U <- rowMeans(R1[, 2, ])  ## calculate within variances
> 
> B <- rowSums(((R1[, 1, ] - Q)^2))/(m. - 1)  ## calculate between variances 
> 
> T <- U + (1 + 1/m.)*B  ## calculate total variances 
> 
> cbind(Estimate=Q, 'Std. Error'=sqrt(T))
  Estimate Std. Error
1 187.5495   9.175414
2 225.9267  21.404731

Data:

> nhanes <- mice::nhanes