I am working on a large dataset of 76 persons with 374 variables. My primary outcome variable is a depression sumscore on a depression severety questionnaire (PHQ-9). There is approximately 4% missing data, so I want to use imputation. I have been working with the mice package following the instructions in Buuren, S. van, & Groothuis-Oudshoorn, K. (2011). mice : Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3). https://doi.org/10.18637/jss.v045.i03. I have tried to replicate their instructions on how to use passive imputation for generating sumscores. However, I get the wrong result. I can't figure out why - I think I have followed the instruction correctly.
I cannot post the data, since it is sensitive, but I am able to replicate the error using this code, which essentially replicates my original code:
library("mice")
library("lattice")
set.seed(1234)
m<-matrix(sample(c(NA, 1:10), 100, replace = T), 10)
df<-as.data.frame(m)
ini<-mice(cbind(df, sumScore=NA), max = 0, print=F)
meth<-ini$method
meth[1:4]<-""
meth[5:10]<-"pmm"
meth["sumScore"]<-"~I(rowSums(df[,5:10]))"
pred<-ini$predictorMatrix
pred[, 1:4]<-0
pred[5:10, "sumScore"]<-0
pred[1:4, "sumScore"]<-1
imp<-mice(cbind(df, sumScore=NA), predictorMatrix = pred, method = meth)
com<-complete(imp, "long", indlude=T)
I get the following output:
.imp .id V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 sumScore
1 1 1 1 7 3 5 6 1 9 1 10 1 0.9224428
2 1 2 6 5 3 2 7 3 3 9 5 9 0.6210974
3 1 3 6 3 1 3 3 7 3 5 1 1 0.3563335
4 1 4 6 10 NA 5 6 5 5 8 5 1 0.0711464
5 1 5 9 3 2 1 3 1 2 3 2 1 0.7318026
6 1 6 7 9 8 8 5 5 7 5 9 5 0.6197897
You have your prediction matrix messed up (and I'm not sure if rowSums
on df
can be used either - I don't think so since df
refers to the original data and not the imputed versions).
The prediction matrix should be read as follows: for each row, which variables (columns) are used to predict this variable. Your matrix looks like this
> pred
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 sumScore
V1 0 0 0 0 1 1 1 1 1 0 1
V2 0 0 0 0 1 1 1 1 1 1 1
V3 0 0 0 0 1 1 1 1 1 1 1
V4 0 0 0 0 1 1 1 1 1 1 1
V5 0 0 0 0 0 1 1 1 1 1 0
V6 0 0 0 0 1 0 1 1 1 1 0
V7 0 0 0 0 1 1 0 1 1 1 0
V8 0 0 0 0 1 1 1 0 1 1 0
V9 0 0 0 0 1 1 1 1 0 1 0
V10 0 0 0 0 1 1 1 1 1 0 0
sumScore 0 0 0 0 0 0 0 0 0 0 0
When a row only contains zeros then it is not using any of the variables for imputation. This means that none of the variables are really used for prediction of the sumScore
and you end up with random noise.
Try this code instead
library("mice")
library("lattice")
set.seed(1234)
m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
df <- cbind(as.data.frame(m), sumScore=NA)
ini<-mice(df, max = 0, print=FALSE)
meth<-ini$method
meth[1:4] <- "" # Never impute for these variables
meth[5:10]<-"pmm" # Use pmm to impute for these
meth["sumScore"] <- "~I(V5+V6+V7+V8+V9+V10)"
pred <- ini$predictorMatrix
pred[, 1:4] <- 0 # Never use V1-V4 for imputation (since you had the same)
pred[1:4, "sumScore"] <- 1 # Use the sum to impute for first 4 (but no method so no point!)
pred[paste0("V", 5:10), "sumScore"] <- 0 # Make sure that we dont impute the "wrong way"
pred["sumScore", paste0("V", 5:10)] <- 1 # Make sure that V5 to V10 are available for sumScore
This should give you what you want