The mice
R package provides deterministic regression imputation by specifying method = "norm.predict"
. Due to the nature of deterministic regression imputation, i.e. no noise is added to imputed values, I would expect that imputed values are always the same, no matter which seed I use. With univariate missings, this seems to work. However, I found inconsistencies when I am imputing multivariate missings. Below, the problem is illustrated with a reproducible example:
library("mice")
# Example 1: Univariate missings (works fine)
data1 <- data.frame(x1 = c(NA, NA, NA, 8, 5, 1, 7, 4),
x2 = c(2, 13, 12, 5, 6, 6, 1, 2),
x3 = c(4, 7, 4, 5, 1, 2, 7, 3))
# Impute univariate missings
imp <- mice(data1, method = "norm.predict", m = 1)
complete(imp) # Always the same result
# Example 2: Multivariate missings (leads to inconsistent imputations)
data2 <- data1
data2[4, 2] <- NA
# Impute multivariate missings
imp1 <- mice(data2, method = "norm.predict", m = 1, seed = 111)
imp2 <- mice(data2, method = "norm.predict", m = 1, seed = 222)
# Results are different
complete(imp1)
complete(imp2)
Question: Why are multivariate deterministic regression imputations by mice inconsistent?
From ?mice
have a look at the description of the data.init argument:
data.init A data frame of the same size and type as data, without missing data, used to initialize imputations before the start of the iterative process. The default NULL implies that starting imputation are created by a simple random draw from the data. Note that specification of data.init will start the m Gibbs sampling streams from the same imputations.
This is where the randomness comes from. Not from the norm.predict method itself, which, as you say, is completely deterministic. (you can see the method to confirm this by typing mice.impute.norm.predict
at the console).
So to avoid the random sampling, we have to provide mice
with data.init
:
data.init = data2
for (i in 1:ncol(data.init)) data.init[, i][is.na(data.init[, i])] = 1
imp1 <- mice(data2, method = "norm.predict", m = 1, data.init = data.init, seed = 111)
imp2 <- mice(data2, method = "norm.predict", m = 1, data.init = data.init, seed = 222)
# Results are the same
complete(imp1)
complete(imp2)