I have a matrix that is half-sparse. Half of all cells are blank (na) so when I try to run the 'mice' it tries to work on all of them. I'm only interested in a subset.
Question: In the following code, how do I make "mice" only operate on the first two columns? Is there a clean way to do this using row-lag or row-lead, so that the content of the previous row can help patch holes in the current row?
set.seed(1)
#domain
x <- seq(from=0,to=10,length.out=1000)
#ranges
y <- sin(x) +sin(x/2) + rnorm(n = length(x))
y2 <- sin(x) +sin(x/2) + rnorm(n = length(x))
#kill 50% of cells
idx_na1 <- sample(x=1:length(x),size = length(x)/2)
y[idx_na1] <- NA
#kill more cells
idx_na2 <- sample(x=1:length(x),size = length(x)/2)
y2[idx_na2] <- NA
#assemble base data
my_data <- data.frame(x,y,y2)
#make the rest of the data
for (i in 3:50){
my_data[,i] <- rnorm(n = length(x))
idx_na2 <- sample(x=1:length(x),size = length(x)/2)
my_data[idx_na2,i] <- NA
}
#imputation
est <- mice(my_data)
data2 <- complete(est)
str(data2[,1:3])
Places that I have looked for answers:
I think what you are looking for can be done by modifying the parameter "where" of the mice function. The parameter "where" is equal to a matrix (or dataframe) with the same size as the dataset on which you are carrying out the imputation. By default, the "where" parameter is equal to is.na(data): a matrix with cells equal to "TRUE" when the value is missing in your dataset and equal to "FALSE" otherwise. This means that by default, every missing value in your dataset will be imputed. Now if you want to change this and only impute the values in a specific column (in my example column 2) of your dataset you can do this:
# Define arbitrary matrix with TRUE values when data is missing and FALSE otherwise
A <- is.na(data)
# Replace all the other columns which are not the one you want to impute (let say column 2)
A[,-2] <- FALSE
# Run the mice function
imputed_data <- mice(data, where = A)