Search code examples
pythonpandasmachine-learningscikit-learnsklearn-pandas

Create my custom Imputer for categorical variables sklearn


I have a dataset with a lot of categorical values missing and i would like to make a custom imputer which will fill the empty values with a value equal to "no-variable_name".

For example if a column "Workclass" has a Nan value, replace it with "No Workclass".

I do it like this

X_train['workclass'].fillna("No workclass", inplace = True)

But I would like to make an Imputer, so I can pass it in a pipeline.


Solution

  • You could define a custom transformer using TransformerMixin. Here's a simple example how to define a simple transformer an include it in a pipeline:

    df = pd.DataFrame({'workclass':['class1', np.nan, 'Some other class', 'class1', 
                                    np.nan, 'class12', 'class2', 'class121'], 
                       'color':['red', 'blue', np.nan, 'pink',
                                'green', 'magenta', np.nan, 'yellow']})
    # train test split of X
    df_train = df[:3]
    df_test = df[3:]
    
    print(df_test)
    
      workclass    color
    3    class1     pink
    4       NaN    green
    5   class12  magenta
    6    class2      NaN
    7  class121   yellow
    

    The idea will be to fit using the df_train dataframe, and replicate the transformations on df_test. We could define our custom transformation class inheriting from TransformerMixin:

    from sklearn.pipeline import Pipeline
    from sklearn.base import TransformerMixin
    
    class InputColName(TransformerMixin):
    
        def fit(self, X, y):
            self.fill_with = X.columns
            return self
    
        def transform(self, X):
            return np.where(X.isna(), 'No ' + self.fill_with, X)
    

    Then include it in your pipeline (just using InputColName here to keep the example simple) and fit it with the training data:

    pipeline = Pipeline(steps=[
      ('inputter', InputColName())
    ])
    pipeline.fit(df_train)
    

    Now if we try transforming with unseen data:

    print(pd.DataFrame(pipeline.transform(df_test), columns=df.columns))
    
          workclass     color
    0        class1      pink
    1  No workclass     green
    2       class12   magenta
    3        class2  No color
    4      class121    yellow