Search code examples
pythonscikit-learnone-hot-encodingboosting

scikit pipeline is not proceeded correctly with GridsearchCV


I am trying to feed a dataset with categorical and numerical variable. So I one hot encode the categorical features and input it into a pipeline used in gridsearchCV. The error is at the last row when I try to fit the model. My understanding is it does not perform the job to go through the pipeline before to fit the model as it gives type error on the column name BEFORE encoding. What should be the correct process?

The error:

TypeError: '['First' 'Second' 'Third']' is an invalid key

My code:

y = sample.iloc[:, -1:]
X = sample.iloc[:, :-1]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.90, random_state=2, shuffle=True
)

categorical_columns = [
    "first",
    "second",
    "third"]
numerical_columns = [
    "fourth",
    "thith", 
    "sixth"
]
categorical_encoder = preprocessing.OneHotEncoder()

preprocessing = ColumnTransformer(
    [('cat', categorical_encoder, enc_sample[categorical_columns].values.reshape(-1, 3)),
     ('num', 'passthrough', enc_sample[numerical_columns])])

pipe = Pipeline([
    ('preprocess', preprocessing),
    ('classifier', GradientBoostingRegressor())
])

cv = RepeatedKFold(n_splits=2, n_repeats=2, random_state=3)

search_grid = {
    "classifier__n_estimators": [100],
    "classifier__learning_rate": [0.1],
    "classifier__max_depth": [5],
    "classifier__min_samples_leaf":[8],
    "classifier__subsample":[0.6]
}
search = GridSearchCV(
    estimator=pipe, param_grid=search_grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, return_train_score=True
)
search.fit(X_train, y_train)

As a reference, I used the official doc as follow: https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html


Solution

  • It looks like your column transformer is not selecting the categorical and numerical columns. You can fix that by using sklearn.compose.make_column_selector to select data based on their types.

    You can use it as follow:

    from sklearn.compose import make_column_selector
    preprocessing = ColumnTransformer(
        [('cat', categorical_encoder, make_column_selector(dtype_include=object)),
         ('num', 'passthrough', make_column_selector(dtype_exclude=object))])