Search code examples
pythonrmissing-dataimputation

Missing values imputation for categorical variables in Python


I'm seeking for a good imputation method for this case. I have a dataframe with categorical variables and missing data like the following one:

    import pandas as pd


var1 = ['a','a','a','c','e',None]
var2 = ['p1','p1','p1','p2','p3','p1']
var3 = ['o1','o1','o1','o2','o3','o1']

df = pd.DataFrame({'v1':var1,'v2':var2,'v3':var3})

I'm looking for an imputation method in python (could be R as well) that support just categorical variables. The idea is to predict var1 given var2 and var3. For example, if we want to predict the NONE value that is in var1. We certainly know that the probability of var1='a' given var2='p1' and var3 = 'o1' is 1. I mean, P(var1='a' / (var2='p1',var3='o1') = 3/3 = 1. I thought about programming something like conditional modes but maybe someone already programmed this or there's a better algorithm for this. I just have 3 categorical variables with multiple categories, whose missing values are MCAR. It is very important to mention that my dataset has around a more than a million rows (and about 10% of NAs).

Do you have anything to recommend me?

Thanks in advance, Tomas


Solution

  • You can use K nearest neighbors imputation.

    Here's one example in R:

    library(DMwR)
    
    var1 = c('a','a','a','c','e',NA)
    var2 = c('p1','p1','p1','p2','p3','p1')
    var3 = c('o1','o1','o1','o2','o3','o1')
    
    df = data.frame('v1'=var1,'v2'=var2,'v3'=var3)
    df
    
    knnOutput <- DMwR::knnImputation(df, k = 5) 
    knnOutput
    

    Output:

      v1 v2 v3
    1  a p1 o1
    2  a p1 o1
    3  a p1 o1
    4  c p2 o2
    5  e p3 o3
    6  a p1 o1
    
    

    UPDATE:

    KNN doesn't work well for large data sets. Two options for large data sets are Multinomial imputation and Naive Bayes imputation. Multinomial imputation is a little easier, because you don't need to convert the variables into dummy variables. The Naive Bayes implementation I have shown below is a little more work because it requires you to convert to dummy variables. Below, I show how to fit each of these in R:

    # make data with 6M rows
    var1 = rep(c('a','a','a','c','e',NA), 10**6)
    var2 = rep(c('p1','p1','p1','p2','p3','p1'), 10**6)
    var3 = rep(c('o1','o1','o1','o2','o3','o1'), 10**6)
    df = data.frame('v1'=var1,'v2'=var2,'v3'=var3)
    
    ####################################################################
    ## Multinomial imputation
    library(nnet)
    # fit multinomial model on only complete rows
    imputerModel = multinom(v1 ~ (v2+ v3)^2, data = df[!is.na(df$v1), ])
    
    # predict missing data
    predictions = predict(imputerModel, newdata = df[is.na(df$v1), ])
    
    ####################################################################
    #### Naive Bayes
    library(naivebayes)
    library(fastDummies)
    # convert to dummy variables
    dummyVars <- fastDummies::dummy_cols(df, 
                                         select_columns = c("v2", "v3"), 
                                         ignore_na = TRUE)
    head(dummyVars)
    

    The dummy_cols function adds dummy variables to the existing data frame, so now we will use only columns 4:9 as our training data.

    #     v1 v2 v3 v2_p1 v2_p2 v2_p3 v3_o1 v3_o2 v3_o3
    # 1    a p1 o1     1     0     0     1     0     0
    # 2    a p1 o1     1     0     0     1     0     0
    # 3    a p1 o1     1     0     0     1     0     0
    # 4    c p2 o2     0     1     0     0     1     0
    # 5    e p3 o3     0     0     1     0     0     1
    # 6 <NA> p1 o1     1     0     0     1     0     0
    
    # create training set
    X_train <- na.omit(dummyVars)[, 4:ncol(dummyVars)]
    y_train <- na.omit(dummyVars)[, "v1"]
    
    X_to_impute <- dummyVars[is.na(df$v1), 4:ncol(dummyVars)]
    
    
    Naive_Bayes_Model=multinomial_naive_bayes(x = as.matrix(X_train), 
                                              y = y_train)
    
    # predict missing data
    Naive_Bayes_preds = predict(Naive_Bayes_Model, 
                                      newdata = as.matrix(X_to_impute))
    
    
    # fill in predictions
    df$multinom_preds[is.na(df$v1)] = as.character(predictions)
    df$Naive_Bayes_preds[is.na(df$v1)] = as.character(Naive_Bayes_preds)
    head(df, 15)
    
    
    
    #         v1 v2 v3 multinom_preds Naive_Bayes_preds
    #    1     a p1 o1           <NA>              <NA>
    #    2     a p1 o1           <NA>              <NA>
    #    3     a p1 o1           <NA>              <NA>
    #    4     c p2 o2           <NA>              <NA>
    #    5     e p3 o3           <NA>              <NA>
    #    6 <NA> p1 o1              a                 a
    #    7     a p1 o1           <NA>              <NA>
    #    8     a p1 o1           <NA>              <NA>
    #    9     a p1 o1           <NA>              <NA>
    #    10    c p2 o2           <NA>              <NA>
    #    11    e p3 o3           <NA>              <NA>
    #    12 <NA> p1 o1              a                 a
    #    13    a p1 o1           <NA>              <NA>
    #    14    a p1 o1           <NA>              <NA>
    #    15    a p1 o1           <NA>              <NA>