Missing values imputation for categorical variables in Python

I'm seeking for a good imputation method for this case. I have a dataframe with categorical variables and missing data like the following one:

    import pandas as pd


var1 = ['a','a','a','c','e',None]
var2 = ['p1','p1','p1','p2','p3','p1']
var3 = ['o1','o1','o1','o2','o3','o1']

df = pd.DataFrame({'v1':var1,'v2':var2,'v3':var3})

I'm looking for an imputation method in python (could be R as well) that support just categorical variables. The idea is to predict var1 given var2 and var3. For example, if we want to predict the NONE value that is in var1. We certainly know that the probability of var1='a' given var2='p1' and var3 = 'o1' is 1. I mean, P(var1='a' / (var2='p1',var3='o1') = 3/3 = 1. I thought about programming something like conditional modes but maybe someone already programmed this or there's a better algorithm for this. I just have 3 categorical variables with multiple categories, whose missing values are MCAR. It is very important to mention that my dataset has around a more than a million rows (and about 10% of NAs).

Do you have anything to recommend me?

Thanks in advance, Tomas

Solution

You can use K nearest neighbors imputation.

Here's one example in R:

library(DMwR)

var1 = c('a','a','a','c','e',NA)
var2 = c('p1','p1','p1','p2','p3','p1')
var3 = c('o1','o1','o1','o2','o3','o1')

df = data.frame('v1'=var1,'v2'=var2,'v3'=var3)
df

knnOutput <- DMwR::knnImputation(df, k = 5) 
knnOutput

Output:

  v1 v2 v3
1  a p1 o1
2  a p1 o1
3  a p1 o1
4  c p2 o2
5  e p3 o3
6  a p1 o1

UPDATE:

KNN doesn't work well for large data sets. Two options for large data sets are Multinomial imputation and Naive Bayes imputation. Multinomial imputation is a little easier, because you don't need to convert the variables into dummy variables. The Naive Bayes implementation I have shown below is a little more work because it requires you to convert to dummy variables. Below, I show how to fit each of these in R:

# make data with 6M rows
var1 = rep(c('a','a','a','c','e',NA), 10**6)
var2 = rep(c('p1','p1','p1','p2','p3','p1'), 10**6)
var3 = rep(c('o1','o1','o1','o2','o3','o1'), 10**6)
df = data.frame('v1'=var1,'v2'=var2,'v3'=var3)

####################################################################
## Multinomial imputation
library(nnet)
# fit multinomial model on only complete rows
imputerModel = multinom(v1 ~ (v2+ v3)^2, data = df[!is.na(df$v1), ])

# predict missing data
predictions = predict(imputerModel, newdata = df[is.na(df$v1), ])

####################################################################
#### Naive Bayes
library(naivebayes)
library(fastDummies)
# convert to dummy variables
dummyVars <- fastDummies::dummy_cols(df, 
                                     select_columns = c("v2", "v3"), 
                                     ignore_na = TRUE)
head(dummyVars)

The dummy_cols function adds dummy variables to the existing data frame, so now we will use only columns 4:9 as our training data.

#     v1 v2 v3 v2_p1 v2_p2 v2_p3 v3_o1 v3_o2 v3_o3
# 1    a p1 o1     1     0     0     1     0     0
# 2    a p1 o1     1     0     0     1     0     0
# 3    a p1 o1     1     0     0     1     0     0
# 4    c p2 o2     0     1     0     0     1     0
# 5    e p3 o3     0     0     1     0     0     1
# 6 <NA> p1 o1     1     0     0     1     0     0

# create training set
X_train <- na.omit(dummyVars)[, 4:ncol(dummyVars)]
y_train <- na.omit(dummyVars)[, "v1"]

X_to_impute <- dummyVars[is.na(df$v1), 4:ncol(dummyVars)]


Naive_Bayes_Model=multinomial_naive_bayes(x = as.matrix(X_train), 
                                          y = y_train)

# predict missing data
Naive_Bayes_preds = predict(Naive_Bayes_Model, 
                                  newdata = as.matrix(X_to_impute))


# fill in predictions
df$multinom_preds[is.na(df$v1)] = as.character(predictions)
df$Naive_Bayes_preds[is.na(df$v1)] = as.character(Naive_Bayes_preds)
head(df, 15)

#         v1 v2 v3 multinom_preds Naive_Bayes_preds
#    1     a p1 o1           <NA>              <NA>
#    2     a p1 o1           <NA>              <NA>
#    3     a p1 o1           <NA>              <NA>
#    4     c p2 o2           <NA>              <NA>
#    5     e p3 o3           <NA>              <NA>
#    6 <NA> p1 o1              a                 a
#    7     a p1 o1           <NA>              <NA>
#    8     a p1 o1           <NA>              <NA>
#    9     a p1 o1           <NA>              <NA>
#    10    c p2 o2           <NA>              <NA>
#    11    e p3 o3           <NA>              <NA>
#    12 <NA> p1 o1              a                 a
#    13    a p1 o1           <NA>              <NA>
#    14    a p1 o1           <NA>              <NA>
#    15    a p1 o1           <NA>              <NA>