Trying to use GridSearchCV returning error: Check the list of available parameters with `estimator.get_params().keys()`

I'm trying to build a Voting Ensemble model, with a data transformation pipeline. I still need to put the transformation of the response variable into the pipeline. I'm trying to use GridSearchCV to evaluate the best parameters for each algorithm, but when I try to run the last code block, I get an error.

dummy= pd.get_dummies(df['over_30'])
df = pd.concat((df, dummy), axis = 1)
df = df.drop(['over_30','N'], axis = 1)
df = df.rename(columns = {'Y':'over_30'})

X,y = df.drop(['over_30'], axis = 1), df[['over_30']]

categorical = ['business_sector', "state"]
numerical = ['valor_contrato', 'prazo', 'num_avalistas', 'annual_revenue',
             'risk', 'carteira_vencer_curto_prazo', 'carteira_vencer_longo_prazo',
             'risk_fintech_fidc', 'risk_pos_money', 'alavancagem_rate', 'patrimonio_socios',
             'target_amount', 'score', 'pib', 'company_month', 'week_month']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

variable_transformer = ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, numerical),
        ('categorical', categorical_transformer, categorical)],
         remainder='passthrough')

classifiers = [
    XGBClassifier(),
    LGBMClassifier(),
    RandomForestClassifier()
]

xgbclassifier_parameters = {
    'classifier__eta' : [0.001,0.3],
    'classifier__gamma' : [0],
    'classifier__max_depth' : [3, 7],
    'classifier__grow_policy' : ['lossguide', 'deptwise'],
    'classifier__objective' : ['reg:logistic'],
    'classifier__reg_lambda' : [1.25, 1],
    'classifier__subsample' : [0.5, 0.6, 0.7],
    'classifier__tree_method' : ['auto', 'hist'],
    'classifier__colsample_bytree' : [0.7, 0.8, 0.9, 1.0],
    'classifier__max_leaves' : [0, 7]
}

randomforest_paramenters = {
    'classifier__n_estimators': [200, 500],
    'classifier__max_features': ['auto', 'sqrt', 'log2'],
    'classifier__max_depth': [4, 5, 6, 7, 8],
}

lightgbm_parameters = {
    'classifier__num_leaves': [31, 127],
    'classifier__reg_alpha': [0.1, 0.5],
    'classifier__min_data_in_leaf': [30, 50, 100, 300, 400],
    'classifier__lambda_l1': [0, 1, 1.5],
    'classifier__lambda_l2': [0, 1]
}

parameters = [
    xgbclassifier_parameters,
    randomforest_paramenters,
    lightgbm_parameters
]

estimators = []

# iterate through each classifier and use GridSearchCV
for i, classifier in enumerate(classifiers):
    # create a Pipeline object
    pipe = Pipeline(steps=[
        ('transformer', variable_transformer),
        ('classifier', classifier)
    ])
    clf = GridSearchCV(pipe,
                       param_grid=parameters[i],
                       scoring=['f1_weighted', 
                                'f1_macro', 
                                'recall', 
                                'roc_auc',
                                'precision'],
                        refit='recall',
                       cv=8)
    clf.fit(X, y)
    print("Tuned Hyperparameters :", clf.best_params_)
    print("Recall:", clf.best_score_)
    # add the clf to the estimators list
    estimators.append((classifier.__class__.__name__, clf))

But when I run this last cell, i get this error:

min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False). Check the list of available parameters with `estimator.get_params().keys()`.

Someone can help me?

Solution

Always, please post the stack trace of the error for people to understand

There are multiple mistakes in your code,

You are creating Pipeline Object using varible_transformer, where are you fitting it?
What is X and y?

Solution:

Separate X-> input features needed for training and y-> the output variable values which the model has to learn.
Create pipeline object, it is a wrapper that does the preprocessing for you, so fit it first before giving the input features to model.
After fitting the pipeline object, you give the resultant numpy array to the classifier as the X and corresponding y to fit the model/classifier.

I am showing an example of regressors with the data that I had handy.

# median_house_value is what I am trying to estimate.
# input_features = ['longitude','latitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income']

df = pd.read_csv("/filepath/california_housing_train.csv")
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]

categorical = []
numerical = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
         'total_bedrooms', 'population', 'households',
         'median_income']

numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])

variable_transformer = ColumnTransformer(
transformers=[
    ('numeric', numeric_transformer, numerical),
    ('categorical', categorical_transformer, categorical)],
     remainder='passthrough')
regressors = [XGBRegressor(),RandomForestRegressor()]

xgbregressor_parameters = { 
'regressor__grow_policy' : ['lossguide', 'deptwise'],
'regressor__objective' : ['reg:squarederror'],   
'regressor__colsample_bytree' : [0.7, 0.8, 0.9, 1.0],
'regressor__max_leaves' : [0, 7]}

 randomforest_parameters = {
'n_estimators': [200, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [4, 5, 6, 7, 8]}

parameters = [xgbregressor_parameters, randomforest_parameters]

estimators = []

pipe = Pipeline(steps=[
('transformer', variable_transformer)])


# fit the pipeline with input features for preprocessing
prepared_data = pipe.fit_transform(X)

# iterate through each regressor and use GridSearchCV
for i, regressor in enumerate(regressors):   

    clf = GridSearchCV(regressor,
                   param_grid=parameters[i],
                   scoring=['neg_mean_squared_error', 
                            'r2', 
                            'explained_variance', 
                            ],
                    refit='neg_mean_squared_error',
                   cv=2)
clf.fit(prepared_data, y)
print("Tuned Hyperparameters :", clf.best_params_)

# add the clf to the estimators list    
estimators.append((regressor.__class__.__name__, clf))

# Output:
 
Tuned Hyperparameters : {'regressor__colsample_bytree': 0.7, 'regressor__grow_policy': 'lossguide', 'regressor__max_leaves': 0, 'regressor__objective': 'reg:squarederror'}
Tuned Hyperparameters : {'max_depth': 4, 'max_features': 'sqrt', 'n_estimators': 200}

Note: Delete the classifier tag prepended to the parameter names for RandomForestclassifier in your case.