python pandas machine-learning scikit-learn sklearn-pandas

Create my custom Imputer for categorical variables sklearn

I have a dataset with a lot of categorical values missing and i would like to make a custom imputer which will fill the empty values with a value equal to "no-variable_name".

For example if a column "Workclass" has a Nan value, replace it with "No Workclass".

I do it like this

X_train['workclass'].fillna("No workclass", inplace = True)

But I would like to make an Imputer, so I can pass it in a pipeline.

Solution

You could define a custom transformer using TransformerMixin. Here's a simple example how to define a simple transformer an include it in a pipeline:

df = pd.DataFrame({'workclass':['class1', np.nan, 'Some other class', 'class1', 
                                np.nan, 'class12', 'class2', 'class121'], 
                   'color':['red', 'blue', np.nan, 'pink',
                            'green', 'magenta', np.nan, 'yellow']})
# train test split of X
df_train = df[:3]
df_test = df[3:]

print(df_test)

  workclass    color
3    class1     pink
4       NaN    green
5   class12  magenta
6    class2      NaN
7  class121   yellow

The idea will be to fit using the df_train dataframe, and replicate the transformations on df_test. We could define our custom transformation class inheriting from TransformerMixin:

from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin

class InputColName(TransformerMixin):

    def fit(self, X, y):
        self.fill_with = X.columns
        return self

    def transform(self, X):
        return np.where(X.isna(), 'No ' + self.fill_with, X)

Then include it in your pipeline (just using InputColName here to keep the example simple) and fit it with the training data:

pipeline = Pipeline(steps=[
  ('inputter', InputColName())
])
pipeline.fit(df_train)

Now if we try transforming with unseen data:

print(pd.DataFrame(pipeline.transform(df_test), columns=df.columns))

      workclass     color
0        class1      pink
1  No workclass     green
2       class12   magenta
3        class2  No color
4      class121    yellow