Search code examples
rimputationr-mice

MICE in R: Why is passive imputation result influenced by column position?


I recently started using the package mice for the first time and have a series of summed scores on different measures that I needed to passively impute.

I followed the instructions in the relevant vignette as closely as I could, but no matter what I did, my summed columns imputed through mice did not add up to the expected value.

After hours of experimenting, I've realised that the passive imputation only works when the summed column is positioned after the columns included in the sum.

I'll use nhanes to illustrate.

library(tidyverse)
library(mice)

>head(nhanes)
  age  bmi hyp chl
1   1   NA  NA  NA
2   2 22.7   1 187
3   1   NA   1 187
4   3   NA  NA  NA
5   1 20.4   1 113
6   3   NA  NA 184

Now to keep it simple to check if it's worked, lets create a column called sum by adding hyp + col:

nhanes_sum <- nhanes %>%
  mutate(sum = hyp+chl)

> head(nhanes_sum)
  age  bmi hyp chl sum
1   1   NA  NA  NA  NA
2   2 22.7   1 187 188
3   1   NA   1 187 188
4   3   NA  NA  NA  NA
5   1 20.4   1 113 114
6   3   NA  NA 184  NA

Next, I want to use mice to impute the missing values of hyp and chl, then create the sum.

imp0 <- mice(nhanes_sum, maxit = 0)

meth <- imp0$method
pred <- imp0$pred

# set the method for sum
meth["sum"] <- "~I(hyp+chl)"
>meth
          age           bmi           hyp           chl           sum 
           ""         "pmm"         "pmm"         "pmm" "~I(hyp+chl)" 

# use hyp and chl to impute sum
pred["sum", c("hyp", "chl")] <- 1

>pred
    age bmi hyp chl sum
age   0   1   1   1   0
bmi   1   0   1   1   0
hyp   1   1   0   1   0
chl   1   1   1   0   0
sum   0   0   1   1   0

# run imputation with 1 iteration
imp <- mice(nhanes_sum, maxit = 1, meth = meth, pred = pred, seed = 2)

> head(complete(imp))
  age  bmi hyp chl sum
1   1 27.2   1 238 239
2   2 22.7   1 187 188
3   1 22.0   1 187 188
4   3 21.7   1 186 187
5   1 20.4   1 113 114
6   3 25.5   2 184 186

We can see that this has worked as expected. e.g. sum in row 1 is equal to hyp + chl even though it was NA before.

But what happens if we put the sum at the beginning of the dataframe?

nhanes_sum2 <- nhanes_sum %>%
  select(sum, everything())

> head(nhanes_sum2)
  sum age  bmi hyp chl
1  NA   1   NA  NA  NA
2 188   2 22.7   1 187
3 188   1   NA   1 187
4  NA   3   NA  NA  NA
5 114   1 20.4   1 113
6  NA   3   NA  NA 184

# repeat same process as above:

imp0.2 <- mice(nhanes_sum2, maxit = 0)

meth2 <- imp0.2$method
pred2 <- imp0.2$pred

meth2["sum"] <- "~I(hyp+chl)"
> meth2
          sum           age           bmi           hyp           chl 
"~I(hyp+chl)"            ""         "pmm"         "pmm"         "pmm" 

pred2["sum", c("hyp", "chl")] <- 1
> pred2
    sum age bmi hyp chl
sum   0   0   0   1   1
age   0   0   1   1   1
bmi   0   1   0   1   1
hyp   0   1   1   0   1
chl   0   1   1   1   0

imp2 <- mice(nhanes_sum2, maxit = 1, meth = meth2, pred = pred2, seed = 2)

# check result
>head(complete(imp2))
  sum age  bmi hyp chl
1 230   1 27.2   1 131
2 188   2 22.7   1 187
3 188   1 20.4   1 187
4 189   3 20.4   1 184
5 114   1 20.4   1 113
6 185   3 22.7   1 184

Now in row 1 (which had been NA), sum = 230 even though hyp = 1 and chl = 131.

Why does this happen?


Solution

  • As mentioned in the mice() documentation,

    Though not strictly needed, it is often useful to specify visitSequence such that the column that is imputed by the ~ mechanism is visited each time after one of its predictors was visited. In that way, deterministic relation between columns will always be synchronized.

    The problem you've found is that after imputing sum via passive imputation, either the hyp or chl variable (or both) is imputed with a new value. This replaces the previous value of hyp or chl, and requires sum to be recalculated. mice() will automatically detect the correct order to impute values in some situations, and monotone imputation sometimes works. But in this case we need to help it out.

    Checking imp2$visitSequence will tell you the order that imputations were being created.

    imp2$visitSequence
    # [1] "sum" "age" "bmi" "hyp" "chl"
    

    Setting a new visitSequence where sum comes after hyp and chl solves the issue. Note that we also set maxit=2 to avoid a warning message.

    visit <- imp2$visitSequence
    visit2 <- c(visit, "sum")
    imp3 <- mice(nhanes_sum2, maxit = 2, meth = meth2, pred = pred2, 
                 visitSequence = visit2, seed = 2)
    head(complete(imp3))
    #   sum age  bmi hyp chl
    # 1 200   1 29.6   1 199
    # 2 188   2 22.7   1 187
    # 3 188   1 28.7   1 187
    # 4 205   3 20.4   1 204
    # 5 114   1 20.4   1 113
    # 6 186   3 20.4   2 184