Search code examples
pandasscikit-learnpipeline

OneHotEncoder not transforming new columns created by previous transformer


I am using a ColumnTransformer to create a pipeline of two transformers - one that converts time column to multiple features like day, month, week etc. This is followed by a OHE transformer to encode the categorical columns.

I am using the code below:

time_col = ['visitStartTime']


class TimeTransformer:
    def fit(self, X, y):
            return self
        
    def transform(self, X):
        for column in X.columns:
            X['time'] = pd.to_datetime(X[column], unit = 's', origin = 'unix')
            X['day_of_week'] = pd.to_datetime(X['time']).dt.strftime('%A')
            X['hour'] = pd.to_datetime(X['time']).dt.hour
            X['day'] = pd.to_datetime(X['time']).dt.day
            X['month'] = pd.to_datetime(X['time']).dt.month
            X['year'] = pd.to_datetime(X['time']).dt.year
            X = X.drop(['time'], axis = 1)
        return X

#Transformer to handle visitStartTime
time_transformer = Pipeline(steps =[
    ('time', TimeTransformer())
])

#Transformer to encode categorical features
ohe_transformer = Pipeline(steps = [
    ('ohe', OneHotEncoder())
])

from sklearn.compose import make_column_selector as selector
#Combined transfomrer
preprocessor = ColumnTransformer(transformers = [
    ('date', time_transformer, time_col ),
    ('ohe',ohe_transformer, selector(dtype_include = 'object'))
],remainder = 'passthrough', sparse_threshold = 0)

j = preprocessor.fit_transform(X_train)

When i check the output of j, i see that the categorical columns which were created as a result of time_transformer has not been converted.

output

How to correct this?


Solution

  • OneHotEncoder has categories='auto' as default setting, which means it tries to detect the columns that need to be converted automatically.

    There are two things you can do:

    1. Convert the columns you want to be treated as categorical to str or better categorical: df[col] = df[col].astype('category')
    2. Explicitly define your columns that need to be converted in OneHotEncoder: OneHotEncoder(categories=['col1', 'col2', ...])