Search code examples
pythonscikit-learnimputation

KNN imputer with nominal, ordinal and numerical variables


I have the following data:

# Libraries
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.metrics.pairwise import nan_euclidean_distances

# Data set
toy_example = pd.DataFrame(data = {"Color": ["Blue", "Red", "Green", "Blue", np.nan],
                                   "Size": ["S", "M", "L", np.nan, "S"],
                                   "Weight": [10, np.nan, 15, 12, np.nan],
                                   "Age": [2, 4, np.nan, 3, 1]})
toy_example

I want to impute the variables Color (nominal), Size (ordinal), Weight (numerical) and Age (numerical) where I want to use KNN imputer using the distance metric nan_euclidean from sklearn.impute.KNNImputer

I now that I need to pre-process the data first. Therefore I came up with the following 2 solutions

a. One hot encoding for the nominal variable where the NaN values are encoded as a category

# Preprocessing the data
color_encoder = OneHotEncoder()
color_encoder.fit(X=toy_example[["Color"]])
## Checking categories and names
### A na dummy is included by default
color_encoder.categories_
color_encoder.get_feature_names_out()

# Create a new DataFrame with the one-hot encoded "Color" column
color_encoded = pd.DataFrame(color_encoder.transform(toy_example[["Color"]]).toarray(),
                             columns=color_encoder.get_feature_names_out(["Color"]))
color_encoded

# Create a dictionary to map the ordinal values of the "Size" column to numerical values
size_map = {"S": 1, "M": 2, "L": 3}
size_map
toy_example["Size"] = toy_example["Size"].map(size_map)

# Concatenate encoded variables with numerical variables
preprocessed_data = pd.concat([color_encoded, toy_example[["Size", "Weight", "Age"]]], 
                              axis=1)
preprocessed_data

## Matrix of euclidean distances
matrix_nan_euclidean = nan_euclidean_distances(X=preprocessed_data)
matrix_nan_euclidean

# Perform nearest neighbors imputation
imputer = KNNImputer(n_neighbors=2)
imputed_df = pd.DataFrame(imputer.fit_transform(preprocessed_data), 
                          columns=preprocessed_data.columns)
## Here I have a problem where the NaN value in the variable
## "Color" in relation to the 5th row is not imputed
### I was expecting a 0 in the Color_nan and a positive value
### in any of the columns Color_Blue, Color_Green, Color_Red
imputed_df 

As I mention in the comments of the code this solution is not feasible for the case of the nominal variable because I obtain the following result where the nominal variable is not imputed:

   Color_Blue  Color_Green  Color_Red  Color_nan  Size  Weight  Age
0         1.0          0.0        0.0        0.0   1.0    10.0  2.0
1         0.0          0.0        1.0        0.0   2.0    13.5  4.0
2         0.0          1.0        0.0        0.0   3.0    15.0  2.5
3         1.0          0.0        0.0        0.0   1.5    12.0  3.0
4         0.0          0.0        0.0        1.0   1.0    12.5  1.0

For the case of the ordinal variable at least the value is imputed where I need to decide the appropiate roundig method to apply (classical rounding, ceiling or floor)

b. One hot encoding for the nominal variable where the NaN values are not encoded as a category and the rest of the dummy variables are considered NaN

# Preprocessing the data
color_encoder = OneHotEncoder()
color_encoder.fit(X=toy_example[["Color"]])
## Checking categories and names
### A na dummy is included by default
color_encoder.categories_
color_encoder.get_feature_names_out()

# Create a new DataFrame with the one-hot encoded "Color" column
color_encoded = pd.DataFrame(color_encoder.transform(toy_example[["Color"]]).toarray(),
                             columns=color_encoder.get_feature_names_out(["Color"]))
color_encoded
## Don't take into account the nan values as a separate category 
color_encoded = color_encoded.loc[:, "Color_Blue":"Color_Red"]
## Because I don't know in advance the values of the dummy variables
## I will replace them with NaN values which is a logical solution taking
## into account that I don't know the value of this observation in relation
## to the "Color" variable
color_encoded.iloc[4, :] = np.nan
color_encoded

# Create a dictionary to map the ordinal values of the "Size" column to numerical values
size_map = {"S": 1, "M": 2, "L": 3}
size_map
toy_example["Size"] = toy_example["Size"].map(size_map)

# Concatenate encoded variables with numerical variables
preprocessed_data = pd.concat([color_encoded, toy_example[["Size", "Weight", "Age"]]], 
                              axis=1)
preprocessed_data

## Matrix of euclidean distances
matrix_nan_euclidean = nan_euclidean_distances(X=preprocessed_data)
matrix_nan_euclidean

# Perform nearest neighbors imputation
imputer = KNNImputer(n_neighbors=2)
imputed_df = pd.DataFrame(imputer.fit_transform(preprocessed_data), 
                          columns=preprocessed_data.columns)
## Here I have a problem because I will need to decide
## how to round the values using classical rounding, 
## ceiling or floor in relation to the 5th row. However 
## any of this methods are inconsistent because an 
## observation cannot be Blue and Green at the same time 
## but it needs to be at least Blue, Green or Red
imputed_df

As I mention in the comments of the code this solution is not feasible for the case of the nominal variable because I obtain the following result where the nominal variable takes 2 values or doesn't take any value:

   Color_Blue  Color_Green  Color_Red  Size  Weight  Age
0         1.0          0.0        0.0   1.0    10.0  2.0
1         0.0          0.0        1.0   2.0    13.5  4.0
2         0.0          1.0        0.0   3.0    15.0  3.5
3         1.0          0.0        0.0   1.5    12.0  3.0
4         0.5          0.5        0.0   1.0    12.5  1.0

Taking into account that a. and b. doesn't work, anyone knows how to impute a nominal variable in a consistent way using multivariate imputation?

So, how can I impute the observations of the toy_example for the case of the nominal variable using multivariate imputation?


Solution

  • KNNImputer is not suited for categorical features (both ordinal and nominal), since, as stated in the scikit-learn docs:

    Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set.

    It uses the mean of the neighbors while you need the mode instead, or a category in general.

    What you can do is: use the KNNImputer as a pre-processing step to impute numerical features first, then train a classifier that handles NaN, one for each categorical feature. In your case you have Color and Size, so two classifiers are needed.

    I tried the following:

    import pandas as pd
    import numpy as np
    
    from sklearn.ensemble import HistGradientBoostingClassifier
    from sklearn.impute import KNNImputer
    
    # Data set
    toy_example = pd.DataFrame(data = {"Color": ["Blue", "Red", "Green", "Blue", np.nan],
                                       "Size": ["S", "M", "L", np.nan, "S"],
                                       "Weight": [10, np.nan, 15, 12, np.nan],
                                       "Age": [2, 4, np.nan, 3, 1]})
    size_map = {"S": 1, "M": 2, "L": 3}
    color_map = {'Blue': 1, 'Red': 2, 'Green': 3}
    
    
    # 1. Convert categories to integers (i.e. class labels)
    toy_example["Size"] = toy_example["Size"].map(size_map)
    toy_example['Color'] = toy_example["Color"].map(color_map)
    toy_example
       Color  Size  Weight  Age
    0    1.0   1.0    10.0  2.0
    1    2.0   2.0     NaN  4.0
    2    3.0   3.0    15.0  NaN
    3    1.0   NaN    12.0  3.0
    4    NaN   1.0     NaN  1.0
    
    # 2. Use the KNN imputer for numerical features only, like weight and age
    imputer = KNNImputer(n_neighbors=2, weights='distance')  # or 'uniform'
    y = imputer.fit_transform(toy_example.values)
    
    toy_example['Weight'] = y[:, 2]
    toy_example['Age'] = y[:, 3]
    toy_example
       Color  Size     Weight  Age
    0    1.0   1.0  10.000000  2.0
    1    2.0   2.0  13.500000  4.0
    2    3.0   3.0  15.000000  3.0
    3    1.0   NaN  12.000000  3.0
    4    NaN   1.0  11.306019  1.0
    
    # 3. Train a classifier to impute color and size (nominal and ordinal)
    # training data and targets (size)
    x_size = toy_example.values[:, [0, 2, 3]]
    y_size = toy_example.values[:, 1]
    
    # discard entries with nan
    is_nan = np.isnan(y_size)
    
    x_train = x_size[~is_nan]
    y_train = y_size[~is_nan]
    
    # Train classifier and impute size
    clf = HistGradientBoostingClassifier(l2_regularization=1.0).fit(x_train, y_train)
    
    toy_example.loc[is_nan, 'Size'] = clf.predict(x_size[is_nan])
    toy_example
       Color  Size     Weight  Age
    0    1.0   1.0  10.000000  2.0
    1    2.0   2.0  13.500000  4.0
    2    3.0   3.0  15.000000  3.0
    3    1.0   1.0  12.000000  3.0
    4    NaN   1.0  11.306019  1.0
    
    # 4. Impute color
    # training data and targets (color)
    x_color = toy_example.values[:, 1:]
    y_color = toy_example.values[:, 0]
    
    # discard entries with nan
    is_nan = np.isnan(y_color)
    
    x_train = x_color[~is_nan]
    y_train = y_color[~is_nan]
    
    # train classifier and impute size
    clf = HistGradientBoostingClassifier(l2_regularization=1.0).fit(x_train, y_train)
    
    toy_example.loc[is_nan, 'Color'] = clf.predict(x_color[is_nan])
    toy_example
       Color  Size     Weight  Age
    0    1.0   1.0  10.000000  2.0
    1    2.0   2.0  13.500000  4.0
    2    3.0   3.0  15.000000  3.0
    3    1.0   1.0  12.000000  3.0
    4    1.0   1.0  11.306019  1.0
    
    # 5. Finally, map back to original categories
    size_reverse_map = {v: k for k, v in size_map.items()}
    color_reverse_map = {v: k for k, v in color_map.items()}
    
    toy_example["Size"] = toy_example["Size"].map(size_reverse_map)
    toy_example['Color'] = toy_example["Color"].map(color_reverse_map)
    toy_example
    

    The final output is the following:

       Color Size     Weight  Age
    0   Blue    S  10.000000  2.0
    1    Red    M  13.500000  4.0
    2  Green    L  15.000000  3.0
    3   Blue    S  12.000000  3.0
    4   Blue    S  11.306019  1.0
    

    Indeed, with more samples and right model capacity you'll get accurate predictions for the missing values.