I am trying to feed a dataset with categorical and numerical variable. So I one hot encode the categorical features and input it into a pipeline used in gridsearchCV. The error is at the last row when I try to fit the model. My understanding is it does not perform the job to go through the pipeline before to fit the model as it gives type error on the column name BEFORE encoding. What should be the correct process?
The error:
TypeError: '['First' 'Second' 'Third']' is an invalid key
My code:
y = sample.iloc[:, -1:]
X = sample.iloc[:, :-1]
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.90, random_state=2, shuffle=True
)
categorical_columns = [
"first",
"second",
"third"]
numerical_columns = [
"fourth",
"thith",
"sixth"
]
categorical_encoder = preprocessing.OneHotEncoder()
preprocessing = ColumnTransformer(
[('cat', categorical_encoder, enc_sample[categorical_columns].values.reshape(-1, 3)),
('num', 'passthrough', enc_sample[numerical_columns])])
pipe = Pipeline([
('preprocess', preprocessing),
('classifier', GradientBoostingRegressor())
])
cv = RepeatedKFold(n_splits=2, n_repeats=2, random_state=3)
search_grid = {
"classifier__n_estimators": [100],
"classifier__learning_rate": [0.1],
"classifier__max_depth": [5],
"classifier__min_samples_leaf":[8],
"classifier__subsample":[0.6]
}
search = GridSearchCV(
estimator=pipe, param_grid=search_grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, return_train_score=True
)
search.fit(X_train, y_train)
As a reference, I used the official doc as follow: https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
It looks like your column transformer is not selecting the categorical and numerical columns. You can fix that by using sklearn.compose.make_column_selector
to select data based on their types.
You can use it as follow:
from sklearn.compose import make_column_selector
preprocessing = ColumnTransformer(
[('cat', categorical_encoder, make_column_selector(dtype_include=object)),
('num', 'passthrough', make_column_selector(dtype_exclude=object))])