I am implementing a pre-processing pipeline using sklearn's pipeline transformers. My pipeline includes sklearn's KNNImputer estimator that I want to use to impute categorical features in my dataset. (My question is similar to this thread but it doesn't contain the answer to my question: How to implement KNN to impute categorical features in a sklearn pipeline)
I know that the categorical features have to be encoded before imputation and this is where I am having trouble. With standard label/ordinal/onehot encoders, when trying to encode categorical features with missing values (np.nan) you get the following error:
ValueError: Input contains NaN
I've managed to "by-pass" that by creating a custom encoder where I replace the np.nan with 'Missing':
class CustomEncoder(BaseEstimator, TransformerMixin):
def __init__(self):
self.encoder = None
def fit(self, X, y=None):
self.encoder = OrdinalEncoder()
return self.encoder.fit(X.fillna('Missing'))
def transform(self, X, y=None):
return self.encoder.transform(X.fillna('Missing'))
def fit_transform(self, X, y=None, **fit_params):
self.encoder = OrdinalEncoder()
return self.encoder.fit_transform(X.fillna('Missing'))
preprocessor = ColumnTransformer([
('categoricals', CustomEncoder(), cat_features),
('numericals', StandardScaler(), num_features)],
pipeline = Pipeline([
('preprocessing', preprocessor),
('imputing', KNNImputer(n_neighbors=5))
In this scenario however I cannot find a reasonable way to then set the encoded 'Missing' values back to np.nan before imputing with the KNNImputer.
I've read that I could do this manually using the OneHotEncoder transformer on this thread: Cyclical Loop Between OneHotEncoder and KNNImpute in Scikit-learn, but again, I'd like to implement all of this in a pipeline to automate the entire pre-processing phase.
Has anyone managed to do this? Would anyone recommend an alternative solution? Is imputing with a KNN algorithm maybe not worth the trouble and should I use a simple imputer instead?
Thanks in advance for your feedback!
I am afraid that this cannot work. If you one-hot encode your categorical data, your missing values will be encoded into a new binary variable and KNNImputer will fail to deal with them because:
Anyway, you have a couple of options for imputing missing categorical variables using scikit-learn:
using strategy="most_frequent"
: this will replace missing values using the most frequent value along each column, no matter if they are strings or numeric datasklearn.impute.KNNImputer
with some limitation: you have first to transform your categorical features into numeric ones while preserving the NaN
values (see: LabelEncoder that keeps missing values as 'NaN'), then you can use the KNNImputer
using only the nearest neighbour as replacement (if you use more than one neighbour it will render some meaningless average). For example: import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})
df = df.apply(lambda series: pd.Series(
imputer = KNNImputer(n_neighbors=1)
0 x 1 2.0
1 NaN 6 1.0
2 z 9 NaN
array([[0., 0., 1.],
[0., 1., 0.],
[1., 2., 0.]])
and replicate a MissForest imputer for mixed data (but you will have to processe separately numeric from categorical features). For example: import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})
categorical = ['A']
numerical = ['B', 'C']
df[categorical] = df[categorical].apply(lambda series: pd.Series(
imp_num = IterativeImputer(estimator=RandomForestRegressor(),
max_iter=10, random_state=0)
imp_cat = IterativeImputer(estimator=RandomForestClassifier(),
max_iter=10, random_state=0)
df[numerical] = imp_num.fit_transform(df[numerical])
df[categorical] = imp_cat.fit_transform(df[categorical])