Search code examples
rimputationr-mice

Mice: partial imputation using where argument failing


I encounter a problem with the use of the mice function to do multiple imputation. I want to do imputation only on part of the missing data, what looking at the help seems possible and straightworward. But i can't get it to work. here is the example:

I have some missing data on x and y:

library(mice)
plouf <- data.frame(ID = rep(LETTERS[1:10],each = 10), x = sample(10,100,replace = T), y = sample(10,100,replace = T))
plouf[sample(100,10),c("x","y")] <- NA

I want only to impute missing data on y:

where <- data.frame(ID = rep(FALSE,100),x = rep(FALSE,100),y = is.na(plouf$y))

I do the imputation

plouf.imp <- mice(plouf, m = 1,method="pmm",maxit=5,where = where)

I look at the imputed values:

test <- complete(plouf.imp)

Here i still have NAs on y:

> sum(is.na(test$y))
[1] 10

if I use where to say to impute on all values, it works:

where <- data.frame(ID = rep(FALSE,100),x = is.na(plouf$x),y = is.na(plouf$y))
plouf.imp <- mice(plouf, m = 1,method="pmm",maxit=5,where = where)
test <- complete(plouf.imp)

> sum(is.na(test$y))
[1] 0

but it does the imputation on x too, that I don't want in this specific case (speed reason in a statistial simulation study)

Has anyone any idea ?


Solution

  • This is happening because of below code -

    plouf[sample(100,10),c("x","y")] <- NA
    

    Let's consider your 1st case wherein you want to impute y only. Check it's PredictorMatrix

    plouf.imp <- mice(plouf, m = 1, method="pmm", maxit=5, where = whr)
    plouf.imp
    #PredictorMatrix:
    #   ID x y
    #ID  0 0 0
    #x   0 0 0
    #y   1 1 0
    

    It says that y's missing value will be predicted based on ID & x since it's value is 1 in row y.

    Now check your sample data where you are populating NA in x & y column. You can notice that wherever y is NA x is also having the same NA value.

    So what happens is that when mice refers PredictorMatrix for imputation in y column it encounters NA in x and ignore those rows as all independent variables (i.e. ID & x) are expected to be non-missing in order to predict the outcome i.e. missing values in y.

    Try this -

    library(mice)
    
    #sample data
    set.seed(123)
    plouf <- data.frame(ID = rep(LETTERS[1:10],each = 10), x = sample(10,100,replace = T), y = sample(10,100,replace = T))
    plouf[sample(100,10), "x"] <- NA
    set.seed(999)
    plouf[sample(100,10), "y"] <- NA
    
    #missing value imputation
    whr <- data.frame(ID = rep(FALSE,100), x = rep(FALSE,100), y = is.na(plouf$y))
    plouf.imp <- mice(plouf, m = 1, method="pmm", maxit=5, where = whr)
    test <- complete(plouf.imp)
    sum(is.na(test$y))
    #[1] 1
    

    Here only one value of y is left to be imputed and in this case both x & y are having NA value i.e. row number 39 (similar to your 1st case).