Search code examples
rr-mice

Passive imputation with mice gives wrong sumscore


I am working on a large dataset of 76 persons with 374 variables. My primary outcome variable is a depression sumscore on a depression severety questionnaire (PHQ-9). There is approximately 4% missing data, so I want to use imputation. I have been working with the mice package following the instructions in Buuren, S. van, & Groothuis-Oudshoorn, K. (2011). mice : Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3). https://doi.org/10.18637/jss.v045.i03. I have tried to replicate their instructions on how to use passive imputation for generating sumscores. However, I get the wrong result. I can't figure out why - I think I have followed the instruction correctly.

I cannot post the data, since it is sensitive, but I am able to replicate the error using this code, which essentially replicates my original code:

library("mice")
library("lattice")
set.seed(1234)
m<-matrix(sample(c(NA, 1:10), 100, replace = T), 10)
df<-as.data.frame(m)

ini<-mice(cbind(df, sumScore=NA), max = 0, print=F)
meth<-ini$method
meth[1:4]<-""
meth[5:10]<-"pmm"
meth["sumScore"]<-"~I(rowSums(df[,5:10]))"
pred<-ini$predictorMatrix
pred[, 1:4]<-0
pred[5:10, "sumScore"]<-0
pred[1:4, "sumScore"]<-1

imp<-mice(cbind(df, sumScore=NA), predictorMatrix = pred, method =  meth)
com<-complete(imp, "long", indlude=T)

I get the following output:

    .imp .id V1 V2 V3 V4 V5 V6 V7 V8 V9 V10  sumScore
 1    1   1  1  7  3  5  6  1  9  1 10   1   0.9224428
 2    1   2  6  5  3  2  7  3  3  9  5   9   0.6210974
 3    1   3  6  3  1  3  3  7  3  5  1   1   0.3563335
 4    1   4  6 10 NA  5  6  5  5  8  5   1   0.0711464
 5    1   5  9  3  2  1  3  1  2  3  2   1   0.7318026
 6    1   6  7  9  8  8  5  5  7  5  9   5   0.6197897

Solution

  • You have your prediction matrix messed up (and I'm not sure if rowSums on df can be used either - I don't think so since df refers to the original data and not the imputed versions).

    The prediction matrix should be read as follows: for each row, which variables (columns) are used to predict this variable. Your matrix looks like this

    > pred
             V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 sumScore
    V1        0  0  0  0  1  1  1  1  1   0        1
    V2        0  0  0  0  1  1  1  1  1   1        1
    V3        0  0  0  0  1  1  1  1  1   1        1
    V4        0  0  0  0  1  1  1  1  1   1        1
    V5        0  0  0  0  0  1  1  1  1   1        0
    V6        0  0  0  0  1  0  1  1  1   1        0
    V7        0  0  0  0  1  1  0  1  1   1        0
    V8        0  0  0  0  1  1  1  0  1   1        0
    V9        0  0  0  0  1  1  1  1  0   1        0
    V10       0  0  0  0  1  1  1  1  1   0        0
    sumScore  0  0  0  0  0  0  0  0  0   0        0
    

    When a row only contains zeros then it is not using any of the variables for imputation. This means that none of the variables are really used for prediction of the sumScore and you end up with random noise.

    Try this code instead

    library("mice")
    library("lattice")
    set.seed(1234)
    m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
    df <- cbind(as.data.frame(m), sumScore=NA)
    
    ini<-mice(df, max = 0, print=FALSE)
    meth<-ini$method
    meth[1:4] <- ""      # Never impute for these variables
    meth[5:10]<-"pmm"    # Use pmm to impute for these
    meth["sumScore"] <- "~I(V5+V6+V7+V8+V9+V10)"
    
    pred <- ini$predictorMatrix
    pred[, 1:4] <- 0    # Never use V1-V4 for imputation (since you had the same)
    pred[1:4, "sumScore"] <- 1  # Use the sum to impute for first 4 (but no method so no point!)
    pred[paste0("V", 5:10), "sumScore"] <- 0  # Make sure that we dont impute the "wrong way"
    pred["sumScore", paste0("V", 5:10)] <- 1  # Make sure that V5 to V10 are available for sumScore
    

    This should give you what you want