Search code examples
rfunctiondata.tablemissing-dataimputation

r data.table impute missing values for multiple set of columns


I want to impute missing values for few set of columns. The idea is for numeric variables I want to use the median to impute the NA and for categorical variables I want to use the mode to impute the NA. I did search for how to impute it separately for different set of columns and did not find.

My data is big with many columns so I have it in data.table. Since I am not sure how to do it in data.table, I tried below code base R. I have tried below code but somehow I am messing up with the column name identification it seems.

My data is large and with multiple variables. I am storing numeric variables in vector var_num and I am storing categorical variables in vector var_chr.

Please see my sample code below -

library(data.table)
set.seed(1200)
id <- 1:100
bills <- sample(c(1:20,NA),100,replace = T)
nos <- sample(c(1:80,NA),100,replace = T)
stru <- sample(c("A","B","C","D",NA),100,replace = T)
type <- sample(c(1:7,NA),100,replace = T)
value <- sample(c(100:1000,NA),100,replace = T)

df1 <- as.data.table(data.frame(id,bills,nos,stru,type,value))
class(df1)

var_num <- c("bills","nos","value")
var_chr <- c("stru","type")

impute <- function(x){
  #print(x)
  if(colnames(x) %in% var_num){
    x[is.na(x)] = median(x,na.rm = T)
  } else if (colnames(x) %in% var_chr){
    x[is.na(x)] = mode(x)
  } else {
    x #if not part of var_num and var_chr then nothing needs to be done and return the original value
  }
  return(x)
}


df1_imp_med <- data.frame(apply(df1,2,impute))

When I try to run the above it gives me error Error in if (colnames(x) %in% var_num) { : argument is of length zero

Please help me understand how I can correct this and achieve my requirement.


Solution

  • As suggested in comments, you can use for-set combination in data.table for a faster imputation:

    for(k in names(df1)){
    
          if(k %in% var_num){
    
            # impute numeric variables with median
            med <- median(df1[[k]],na.rm = T)
            set(x = df1, which(is.na(df1[[k]])), k, med)
    
        } else if(k %in% var_char){
    
            ## impute categorical variables with mode
            mode <- names(which.max(table(df1[[k]])))
            set(x = df1, which(is.na(df1[[k]])), k, mode)
        }
    }