I recently started using the package mice
for the first time and have a series of summed scores on different measures that I needed to passively impute.
I followed the instructions in the relevant vignette as closely as I could, but no matter what I did, my summed columns imputed through mice did not add up to the expected value.
After hours of experimenting, I've realised that the passive imputation only works when the summed column is positioned after the columns included in the sum.
I'll use nhanes to illustrate.
library(tidyverse)
library(mice)
>head(nhanes)
age bmi hyp chl
1 1 NA NA NA
2 2 22.7 1 187
3 1 NA 1 187
4 3 NA NA NA
5 1 20.4 1 113
6 3 NA NA 184
Now to keep it simple to check if it's worked, lets create a column called sum
by adding hyp + col
:
nhanes_sum <- nhanes %>%
mutate(sum = hyp+chl)
> head(nhanes_sum)
age bmi hyp chl sum
1 1 NA NA NA NA
2 2 22.7 1 187 188
3 1 NA 1 187 188
4 3 NA NA NA NA
5 1 20.4 1 113 114
6 3 NA NA 184 NA
Next, I want to use mice to impute the missing values of hyp and chl, then create the sum.
imp0 <- mice(nhanes_sum, maxit = 0)
meth <- imp0$method
pred <- imp0$pred
# set the method for sum
meth["sum"] <- "~I(hyp+chl)"
>meth
age bmi hyp chl sum
"" "pmm" "pmm" "pmm" "~I(hyp+chl)"
# use hyp and chl to impute sum
pred["sum", c("hyp", "chl")] <- 1
>pred
age bmi hyp chl sum
age 0 1 1 1 0
bmi 1 0 1 1 0
hyp 1 1 0 1 0
chl 1 1 1 0 0
sum 0 0 1 1 0
# run imputation with 1 iteration
imp <- mice(nhanes_sum, maxit = 1, meth = meth, pred = pred, seed = 2)
> head(complete(imp))
age bmi hyp chl sum
1 1 27.2 1 238 239
2 2 22.7 1 187 188
3 1 22.0 1 187 188
4 3 21.7 1 186 187
5 1 20.4 1 113 114
6 3 25.5 2 184 186
We can see that this has worked as expected. e.g. sum
in row 1 is equal to hyp
+ chl
even though it was NA
before.
But what happens if we put the sum at the beginning of the dataframe?
nhanes_sum2 <- nhanes_sum %>%
select(sum, everything())
> head(nhanes_sum2)
sum age bmi hyp chl
1 NA 1 NA NA NA
2 188 2 22.7 1 187
3 188 1 NA 1 187
4 NA 3 NA NA NA
5 114 1 20.4 1 113
6 NA 3 NA NA 184
# repeat same process as above:
imp0.2 <- mice(nhanes_sum2, maxit = 0)
meth2 <- imp0.2$method
pred2 <- imp0.2$pred
meth2["sum"] <- "~I(hyp+chl)"
> meth2
sum age bmi hyp chl
"~I(hyp+chl)" "" "pmm" "pmm" "pmm"
pred2["sum", c("hyp", "chl")] <- 1
> pred2
sum age bmi hyp chl
sum 0 0 0 1 1
age 0 0 1 1 1
bmi 0 1 0 1 1
hyp 0 1 1 0 1
chl 0 1 1 1 0
imp2 <- mice(nhanes_sum2, maxit = 1, meth = meth2, pred = pred2, seed = 2)
# check result
>head(complete(imp2))
sum age bmi hyp chl
1 230 1 27.2 1 131
2 188 2 22.7 1 187
3 188 1 20.4 1 187
4 189 3 20.4 1 184
5 114 1 20.4 1 113
6 185 3 22.7 1 184
Now in row 1 (which had been NA), sum
= 230 even though hyp
= 1 and chl
= 131.
Why does this happen?
As mentioned in the mice()
documentation,
Though not strictly needed, it is often useful to specify visitSequence such that the column that is imputed by the ~ mechanism is visited each time after one of its predictors was visited. In that way, deterministic relation between columns will always be synchronized.
The problem you've found is that after imputing sum
via passive imputation, either the hyp
or chl
variable (or both) is imputed with a new value. This replaces the previous value of hyp
or chl
, and requires sum
to be recalculated. mice()
will automatically detect the correct order to impute values in some situations, and monotone imputation sometimes works. But in this case we need to help it out.
Checking imp2$visitSequence
will tell you the order that imputations were being created.
imp2$visitSequence
# [1] "sum" "age" "bmi" "hyp" "chl"
Setting a new visitSequence where sum
comes after hyp
and chl
solves the issue. Note that we also set maxit=2
to avoid a warning message.
visit <- imp2$visitSequence
visit2 <- c(visit, "sum")
imp3 <- mice(nhanes_sum2, maxit = 2, meth = meth2, pred = pred2,
visitSequence = visit2, seed = 2)
head(complete(imp3))
# sum age bmi hyp chl
# 1 200 1 29.6 1 199
# 2 188 2 22.7 1 187
# 3 188 1 28.7 1 187
# 4 205 3 20.4 1 204
# 5 114 1 20.4 1 113
# 6 186 3 20.4 2 184