Search code examples
pythonscikit-learnpipeline

Pipeline with SimpleImputer and OneHotEncoder - how to do properly?


I facing a challenge to create a pipeline to impute (SI) a category variable (eg colour) and then onehotencode (OHE) 2 variables (eg colour & dayofweek). colour is used in the 2 steps.

I wanted to put SI and OHE in 1 ColumnTransformer. I just learnt that both SI and OHE running in parallel, meaning OHE will not encode the imputed colour (ie OHE the original un-imputed colour.)

I then tried:

si = SimpleImputer(strategy='mean', add_indicator=True)
ohe = OneHotEncoder(sparse=False,drop='first')

ctsi  = ColumnTransformer(transformers=[('si',si,['colour'])],remainder='passthrough')
ctohe = ColumnTransformer(transformers=[('ohe',ohe,['colour','dayofweek'])],remainder='passthrough')

pl = Pipeline([('ctsi',ctsi),('ctohe',ctohe)])

outfit = pl.fit_transform(X,y)

I get the error:

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

I believe it's because the column name colour has been removed by SI. When I change the OHE columns to a list of int:

ctohe = ColumnTransformer(transformers=[('ohe',ohe,[0,1])],remainder='passthrough')

It goes through. I'm just testing the processing, obviously, the columns are incorrect.

So my challenge here is that given what I want to accomplish, is it possible ? And how can I do that ?

Great many thanks in advance !


Solution

  • Actually, I agree with your reasoning. The problem is given by the fact that ColumnTransformer forgets the column names after the transformation and indeed - quoting the answer in here - ColumnTransformer's intended usage is to deal with transformations applied in parallel. That's also specified in the doc by means of this sentence in my opinion:

    This estimator allows different columns or column subsets of the input to be transformed separately. [...] This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

    I guess that one solution to this might be to go with a custom renaming of the columns, passing a callable to the columns portion of ColumnTransformer's transformers tuple (name, transformer, columns) (notation which follows the documentation) according to your needs (actually, this would work I guess if you pass a callable to your second ColumnTransformer instance in the pipeline). EDIT: I have to withdraw somehow what I wrote, I'm not actually sure that passing a callable to columns might work for your need as your problem does not really stand in column selection per se, but rather in column selection via string column names, for which you would need a DataFrame (and imo acting on column selector only won't solve such problem).

    Instead, you might better need a transformer that somehow changes column names after the imputation and before the one-hot-encoding (still provided that the setting is not the ideal one when different instances of ColumnTransformer have to transform the same variables in sequence in a Pipeline) acting on a DataFrame.

    Actually a couple of months ago, the following https://github.com/scikit-learn/scikit-learn/pull/21078 was merged; I suspect it is not still in the latest release because by upgrading sklearn I couldn't get it to work. Anyway, IMO, in the future it may ease in similar situations as it adds get_feature_names_out() to SimpleImputer and get_feature_names_out() is in turn really useful when dealing with column names.

    In general, I would also suggest the same post linked above for further details.

    Eventually, here's a naive example I could get to; it's not scalable (I tried to get to something more scalable exploiting feature_names_in_ attribute of the fitted SimpleImputer instance, without arriving to a consistent result) but hopefully might give some hints.

    import numpy as np
    import pandas as pd
    from sklearn.impute import SimpleImputer
    from sklearn.pipeline import Pipeline
    from sklearn.base import BaseEstimator, TransformerMixin
    
    X = pd.DataFrame({'city': ['London', 'London', 'Paris', np.NaN],
              'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath'],
              'expert_rating': [5, 3, 4, 5],
              'user_rating': [4, 5, 4, 3]})
    
    ct_1 = ColumnTransformer([('si', SimpleImputer(strategy='most_frequent'), ['city'])],
                  remainder='passthrough')
    ct_2 = ColumnTransformer([('ohe', OneHotEncoder(), ['city'])], remainder='passthrough', verbose_feature_names_out=True)
    
    class ColumnExtractor(BaseEstimator, TransformerMixin):
        def __init__(self, columns):
            self.columns = columns
    
        def transform(self, X, *_):
            return pd.DataFrame(X, columns=self.columns)
    
        def fit(self, *_):
            return self
    
    pipe = Pipeline([
        ('ct_1', ct_1),
        ('ce', ColumnExtractor(['city', 'title', 'expert_rating', 'user_rating'])),
        ('ct_2', ct_2)
    ])
    
    pipe.fit_transform(X)