I am trying to impute values using a linear model using mice. My understanding of mice is that it iterates over the rows. For a column with NAs it is using all other columns as predictors, fits the model, and then samples from this model to fill up the NAs. Here is an example where I generate some data, and than introduce missing data using ampute.
n <- 100
xx<-data.frame(x = 1:n + rnorm(n,0,0.1), y =(1:n)*2 + rnorm(n,0,1))
head(xx)
res <- (ampute(xx))
head(res$amp)
The missing data looks like:
x y
1 NA 3.887147
2 2.157168 NA
3 2.965164 6.639856
4 3.848165 8.720441
5 NA 11.167439
6 NA 12.835415
Then I am trying to impute the missing data:
mic <- mice(res$amp,diagnostics = FALSE )
And I would expect that then there is non, but there are NA always in one of the columns.
colSums(is.na(complete(mic,1)))
And in which of the two it is rather random.
By running the code above I am getting:
> colSums(is.na(complete(mic,1)))
x y
0 30
but also :
> colSums(is.na(complete(mic,1)))
x y
33 0
I tried to run your code and end up with the same type of problem:
library(mice)
n <- 100
xx<-data.frame(x = 1:n + rnorm(n,0,0.1), y =(1:n)*2 + rnorm(n,0,1))
head(xx)
res <- (ampute(xx))
head(res$amp)
if you look at the summary
from the mice
call then you get an indication that something is wrong. My data gives
tempData <- mice(res$amp,m=5,maxit=50,seed=500)
summary(tempData)
Multiply imputed data set
Call:
mice(data = res$amp, m = 5, maxit = 50, seed = 500)
Number of multiple imputations: 5
Missing cells per column:
x y
21 23
Imputation methods:
x y
"pmm" "pmm"
VisitSequence:
x
1
PredictorMatrix:
x y
x 0 0
y 0 0
Random generator seed value: 500
There are two indicators here. One is VisitSequence
which shows that only the first column is visited, x
, and not column y
. Also, the PreditorMatrix
only contains zeros in the off-diagonal so none of the predictors use information from any of the other predictors.
The problem is in your simulated data because the two columns are too colinear, and a similar solution is given in this detailed answer. Because the y
column is essentially twice the value of the x
column it is silently discarded from the analysis.
Try to simulate data that are not almost perfectly linear and it will work. For example a quadratic relationship
n <- 100
xx<-data.frame(x = 1:n + rnorm(n,0,0.1), y =(1:n)**2 + rnorm(n,0,1))
head(xx)
res <- (ampute(xx))
head(res$amp)