I encounter a problem with the use of the mice function to do multiple imputation. I want to do imputation only on part of the missing data, what looking at the help seems possible and straightworward. But i can't get it to work. here is the example:
I have some missing data on x and y:
library(mice)
plouf <- data.frame(ID = rep(LETTERS[1:10],each = 10), x = sample(10,100,replace = T), y = sample(10,100,replace = T))
plouf[sample(100,10),c("x","y")] <- NA
I want only to impute missing data on y:
where <- data.frame(ID = rep(FALSE,100),x = rep(FALSE,100),y = is.na(plouf$y))
I do the imputation
plouf.imp <- mice(plouf, m = 1,method="pmm",maxit=5,where = where)
I look at the imputed values:
test <- complete(plouf.imp)
Here i still have NAs on y:
> sum(is.na(test$y))
[1] 10
if I use where to say to impute on all values, it works:
where <- data.frame(ID = rep(FALSE,100),x = is.na(plouf$x),y = is.na(plouf$y))
plouf.imp <- mice(plouf, m = 1,method="pmm",maxit=5,where = where)
test <- complete(plouf.imp)
> sum(is.na(test$y))
[1] 0
but it does the imputation on x too, that I don't want in this specific case (speed reason in a statistial simulation study)
Has anyone any idea ?
This is happening because of below code -
plouf[sample(100,10),c("x","y")] <- NA
Let's consider your 1st case wherein you want to impute y
only. Check it's PredictorMatrix
plouf.imp <- mice(plouf, m = 1, method="pmm", maxit=5, where = whr)
plouf.imp
#PredictorMatrix:
# ID x y
#ID 0 0 0
#x 0 0 0
#y 1 1 0
It says that y
's missing value will be predicted based on ID
& x
since it's value is 1 in row y.
Now check your sample data where you are populating NA in x
& y
column. You can notice that wherever y
is NA x
is also having the same NA value.
So what happens is that when mice
refers PredictorMatrix
for imputation in y
column it encounters NA in x
and ignore those rows as all independent variables (i.e. ID
& x
) are expected to be non-missing in order to predict the outcome i.e. missing values in y
.
Try this -
library(mice)
#sample data
set.seed(123)
plouf <- data.frame(ID = rep(LETTERS[1:10],each = 10), x = sample(10,100,replace = T), y = sample(10,100,replace = T))
plouf[sample(100,10), "x"] <- NA
set.seed(999)
plouf[sample(100,10), "y"] <- NA
#missing value imputation
whr <- data.frame(ID = rep(FALSE,100), x = rep(FALSE,100), y = is.na(plouf$y))
plouf.imp <- mice(plouf, m = 1, method="pmm", maxit=5, where = whr)
test <- complete(plouf.imp)
sum(is.na(test$y))
#[1] 1
Here only one value of y
is left to be imputed and in this case both x
& y
are having NA value i.e. row number 39 (similar to your 1st case).