Search code examples
rimputation

R: weighted imputation/imputation preferences


Suppose I have a dataset with multiple columns and one of them is gender. As far as I understand, knnImputation() with standard options will compute metric where all the variables are treated equally, while I wish to create some rule, when, for example, having the same gender is strongly preferred when searching for neighbours (e.g., gender has stronger influence on total weight or only rows with the same gender are chosen(this can be done by splitting and then reassembling both training and testing sets, but maybe there exists a simpler way)).

I see that kNNImpute() has the impute.fn parameter for imputation function and the knnImputation() has meth for method. How can I create such a rule that will be flexible and easy to edit (e.g. written as function of something like that)?


Solution

  • This will not do variable selection, but it will impute using kNN using only the rows that have the matching gender g as you suggest in the comments:

    Sys.setenv("PKG_CXXFLAGS"="-std=c++0x") # needed for the lambda functions in Rcpp
    # install/load package, create example data
    devtools::install_github("alexwhitworth/imputation")
    library(imputation)
    
    set.seed(1345)
    g <- sample(c("M", "F"), 100, replace=T)
    a <- matrix(rnorm(1000), ncol=10)
    a[a>1.5] <- NA
    df <- data.frame(a,g)
    
    # subset by gender, exclude character column from kNN (which doesn't 
    # handle character variables)
    df_f <- kNN_impute(df[df$g == "F", 1:10], k= 3, q= 2, check_scale = FALSE, parallel= FALSE)
    df_m <- kNN_impute(df[df$g == "M", 1:10], k= 3, q= 2, check_scale = FALSE, parallel= FALSE)
    
    # recombine. Can use rownames as key
    df2 <- data.frame(rbind(df_f$x, df_m$x))
    df2 <- df2[order(as.integer(rownames(df2))),]
    df2$g <- df$g