Search code examples
machine-learningtensorflowscikit-learnsklearn-pandasimputation

Sklearn: Categorical Imputer?


Is there a way to impute categorical values using a sklearn.preprocessing object? I would like to ultimatly create a preprocessing object which I can apply to new data and have it transformed the same way as old data.

I am looking for a way to do it so that I can use it this way.


Solution

  • Copying and modifying this answer, I made an imputer for a pandas.Series object

    import numpy
    import pandas 
    
    from sklearn.base import TransformerMixin
    
    
    class SeriesImputer(TransformerMixin):
    
        def __init__(self):
            """Impute missing values.
    
            If the Series is of dtype Object, then impute with the most frequent object.
            If the Series is not of dtype Object, then impute with the mean.  
    
            """
        def fit(self, X, y=None):
            if   X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0]
            else                            : self.fill = X.mean()
            return self
    
        def transform(self, X, y=None):
            return X.fillna(self.fill)
    

    To use it you would do:

    # Make a series
    s1 = pandas.Series(['k', 'i', 't', 't', 'e', numpy.NaN])
    
    
    a  = SeriesImputer()   # Initialize the imputer
    a.fit(s1)              # Fit the imputer
    s2 = a.transform(s1)   # Get a new series