python scikit-learn pipeline one-hot-encoding

Error using categorical data in Pipeline with OneHotEncoder

I would like to build a pipeline to predict 'Survival' from the three features 'SibSp_category', 'Parch_category', 'Embarked'.

In the preprocessing step, I use (1) OrdinalEncoder to convert the strings to integers, that shall be (2) imputed by SimpleImputer using the most frequent value. (3) from the imputed features, I would like to create dummy-variables using OneHotEncoder which shall be used as the features in xgb. However, when running the optimizer, I receive a ValueError and I suspect it's thrown by the OneHotEncoder.

Sample code:

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import xgboost as xgb
import lightgbm as lgb
from skopt import BayesSearchCV

df = pd.DataFrame({'SibSp_category': ['alone', 'couple', 'group', 'alone', 'couple', 'group',np.nan],
    'Parch_category': ['alone', 'small', 'large', np.nan, 'alone', 'small', 'large'],
    'Embarked': [np.nan, 'S', 'C', 'Q', 'C', 'Q', 'S'],
    'Survived': [0,1,1,0,0,1,0]})

X = df.drop("Survived", axis=1)
y = df["Survived"]
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

preprocessors = make_column_transformer(
    (OrdinalEncoder(), ['SibSp_category', 'Parch_category', 'Embarked']),
    (SimpleImputer(strategy='most_frequent'), ['SibSp_category', 'Parch_category', 'Embarked']),
    (OneHotEncoder(handle_unknown='ignore', drop='first'), ['SibSp_category', 'Parch_category', 'Embarked'])
)

pipelines = {
    'xgb': {
        'model': Pipeline([('preprocessor', preprocessors),
                           ('classifier', xgb.XGBClassifier())]),
        'params': {
            'classifier__learning_rate': [0.01, 0.1],
            'classifier__max_depth': [3, 5, 7, 9],
            'classifier__n_estimators': [100, 200]
        }
}
}

optimizer = BayesSearchCV(pipeline['model'], pipeline['params'], n_jobs=-1, cv=2, 
                              scoring='accuracy', n_iter=20)
optimizer.fit(X_train, y_train)

...
--> 390                 raise self._exception
    391             finally:
    392                 # Break a reference cycle with the exception in self._exception

ValueError: could not convert string to float: 'alone'

Solution

This is not what ColumnTransformer() does. It runs those three functions on the columns supplied and concatenates the results.

If you wish to run imputation and encoding in order, SimpleImputer() and OneHotEncoder() should be separate pipeline steps, wrapped into ColumnTranformer() if needed, e.g.:

preprocessors = make_column_transformer(
    (Pipeline(
        [
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(handle_unknown='ignore', drop='first')),
        ]
        ), ['SibSp_category', 'Parch_category', 'Embarked'])
)

For the latest sklearn releases, pre-encoding with OrdinalEncoder() is excessive as the other functions support categorical values.