I have a dataset with 200 variables that all have some missing values. Each of the 200 variables has another column that I want to use to impute the missing values.
Example data:
have <- data.frame(ID = c(1:10), var1 = c(runif(7), NA, NA, NA), var1_fill = runif(10))
ID var1 var1_fill
1 1 0.68783885 0.140508053
2 2 0.74672512 0.001270443
3 3 0.09607276 0.917535359
4 4 0.03222775 0.363960434
5 5 0.03560543 0.901288399
6 6 0.46595122 0.725499220
7 7 0.42781890 0.781295939
8 8 NA 0.737999219
9 9 NA 0.456795266
10 10 NA 0.314562042
If I wanted to impute for one column I would use this code:
have$var1_imputed <- ifelse(is.na(have$var1) == T, have$var1_fill, have$var1)
ID var1 var1_fill var1_imputed
1 1 0.68783885 0.140508053 0.68783885
2 2 0.74672512 0.001270443 0.74672512
3 3 0.09607276 0.917535359 0.09607276
4 4 0.03222775 0.363960434 0.03222775
5 5 0.03560543 0.901288399 0.03560543
6 6 0.46595122 0.725499220 0.46595122
7 7 0.42781890 0.781295939 0.42781890
8 8 NA 0.737999219 0.73799922
9 9 NA 0.456795266 0.45679527
10 10 NA 0.314562042 0.31456204
I am having trouble figuring out how to write a loop to do this for 200 variables, because I can't use the $ to reference the column names. In the actual dataset, the variable names don't follow any pattern like var1, var2, etc. However, the original 200 variables are located in columns 7 to 206, and the columns to use for the respective imputation are 207 to 406. The columns for imputation also have the same name as the original columns but with an additional suffix, like in the example (var1 and var1_fill).
A data.table
option using fcoalesce
setDT(df)[
,
setNames(
Map(fcoalesce, .SD[, 7:206], .SD[, 207:406]),
paste0(names(.SD[, 7:206]), "_imputed")
)
]