I want to impute missing values for few set of columns. The idea is for numeric variables I want to use the median to impute the NA
and for categorical variables I want to use the mode to impute the NA
. I did search for how to impute it separately for different set of columns and did not find.
My data is big with many columns so I have it in data.table. Since I am not sure how to do it in data.table, I tried below code base R. I have tried below code but somehow I am messing up with the column name identification it seems.
My data is large and with multiple variables. I am storing numeric variables in vector var_num and I am storing categorical variables in vector var_chr.
Please see my sample code below -
library(data.table)
set.seed(1200)
id <- 1:100
bills <- sample(c(1:20,NA),100,replace = T)
nos <- sample(c(1:80,NA),100,replace = T)
stru <- sample(c("A","B","C","D",NA),100,replace = T)
type <- sample(c(1:7,NA),100,replace = T)
value <- sample(c(100:1000,NA),100,replace = T)
df1 <- as.data.table(data.frame(id,bills,nos,stru,type,value))
class(df1)
var_num <- c("bills","nos","value")
var_chr <- c("stru","type")
impute <- function(x){
#print(x)
if(colnames(x) %in% var_num){
x[is.na(x)] = median(x,na.rm = T)
} else if (colnames(x) %in% var_chr){
x[is.na(x)] = mode(x)
} else {
x #if not part of var_num and var_chr then nothing needs to be done and return the original value
}
return(x)
}
df1_imp_med <- data.frame(apply(df1,2,impute))
When I try to run the above it gives me error Error in if (colnames(x) %in% var_num) { : argument is of length zero
Please help me understand how I can correct this and achieve my requirement.
As suggested in comments, you can use for-set
combination in data.table for a faster imputation:
for(k in names(df1)){
if(k %in% var_num){
# impute numeric variables with median
med <- median(df1[[k]],na.rm = T)
set(x = df1, which(is.na(df1[[k]])), k, med)
} else if(k %in% var_char){
## impute categorical variables with mode
mode <- names(which.max(table(df1[[k]])))
set(x = df1, which(is.na(df1[[k]])), k, mode)
}
}