I'm trying to build a Voting Ensemble model, with a data transformation pipeline. I still need to put the transformation of the response variable into the pipeline. I'm trying to use GridSearchCV to evaluate the best parameters for each algorithm, but when I try to run the last code block, I get an error.
dummy= pd.get_dummies(df['over_30'])
df = pd.concat((df, dummy), axis = 1)
df = df.drop(['over_30','N'], axis = 1)
df = df.rename(columns = {'Y':'over_30'})
X,y = df.drop(['over_30'], axis = 1), df[['over_30']]
categorical = ['business_sector', "state"]
numerical = ['valor_contrato', 'prazo', 'num_avalistas', 'annual_revenue',
'risk', 'carteira_vencer_curto_prazo', 'carteira_vencer_longo_prazo',
'risk_fintech_fidc', 'risk_pos_money', 'alavancagem_rate', 'patrimonio_socios',
'target_amount', 'score', 'pib', 'company_month', 'week_month']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
variable_transformer = ColumnTransformer(
transformers=[
('numeric', numeric_transformer, numerical),
('categorical', categorical_transformer, categorical)],
remainder='passthrough')
classifiers = [
XGBClassifier(),
LGBMClassifier(),
RandomForestClassifier()
]
xgbclassifier_parameters = {
'classifier__eta' : [0.001,0.3],
'classifier__gamma' : [0],
'classifier__max_depth' : [3, 7],
'classifier__grow_policy' : ['lossguide', 'deptwise'],
'classifier__objective' : ['reg:logistic'],
'classifier__reg_lambda' : [1.25, 1],
'classifier__subsample' : [0.5, 0.6, 0.7],
'classifier__tree_method' : ['auto', 'hist'],
'classifier__colsample_bytree' : [0.7, 0.8, 0.9, 1.0],
'classifier__max_leaves' : [0, 7]
}
randomforest_paramenters = {
'classifier__n_estimators': [200, 500],
'classifier__max_features': ['auto', 'sqrt', 'log2'],
'classifier__max_depth': [4, 5, 6, 7, 8],
}
lightgbm_parameters = {
'classifier__num_leaves': [31, 127],
'classifier__reg_alpha': [0.1, 0.5],
'classifier__min_data_in_leaf': [30, 50, 100, 300, 400],
'classifier__lambda_l1': [0, 1, 1.5],
'classifier__lambda_l2': [0, 1]
}
parameters = [
xgbclassifier_parameters,
randomforest_paramenters,
lightgbm_parameters
]
estimators = []
# iterate through each classifier and use GridSearchCV
for i, classifier in enumerate(classifiers):
# create a Pipeline object
pipe = Pipeline(steps=[
('transformer', variable_transformer),
('classifier', classifier)
])
clf = GridSearchCV(pipe,
param_grid=parameters[i],
scoring=['f1_weighted',
'f1_macro',
'recall',
'roc_auc',
'precision'],
refit='recall',
cv=8)
clf.fit(X, y)
print("Tuned Hyperparameters :", clf.best_params_)
print("Recall:", clf.best_score_)
# add the clf to the estimators list
estimators.append((classifier.__class__.__name__, clf))
But when I run this last cell, i get this error:
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False). Check the list of available parameters with `estimator.get_params().keys()`.
Someone can help me?
Always, please post the stack trace of the error for people to understand
There are multiple mistakes in your code,
varible_transformer
, where are you fitting it?X
and y
?Solution:
X
-> input features needed for training and y
-> the output variable values which the model has to learn.X
and corresponding y
to fit the model/classifier.I am showing an example of regressors with the data that I had handy.
# median_house_value is what I am trying to estimate.
# input_features = ['longitude','latitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income']
df = pd.read_csv("/filepath/california_housing_train.csv")
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]
categorical = []
numerical = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
'total_bedrooms', 'population', 'households',
'median_income']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
variable_transformer = ColumnTransformer(
transformers=[
('numeric', numeric_transformer, numerical),
('categorical', categorical_transformer, categorical)],
remainder='passthrough')
regressors = [XGBRegressor(),RandomForestRegressor()]
xgbregressor_parameters = {
'regressor__grow_policy' : ['lossguide', 'deptwise'],
'regressor__objective' : ['reg:squarederror'],
'regressor__colsample_bytree' : [0.7, 0.8, 0.9, 1.0],
'regressor__max_leaves' : [0, 7]}
randomforest_parameters = {
'n_estimators': [200, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [4, 5, 6, 7, 8]}
parameters = [xgbregressor_parameters, randomforest_parameters]
estimators = []
pipe = Pipeline(steps=[
('transformer', variable_transformer)])
# fit the pipeline with input features for preprocessing
prepared_data = pipe.fit_transform(X)
# iterate through each regressor and use GridSearchCV
for i, regressor in enumerate(regressors):
clf = GridSearchCV(regressor,
param_grid=parameters[i],
scoring=['neg_mean_squared_error',
'r2',
'explained_variance',
],
refit='neg_mean_squared_error',
cv=2)
clf.fit(prepared_data, y)
print("Tuned Hyperparameters :", clf.best_params_)
# add the clf to the estimators list
estimators.append((regressor.__class__.__name__, clf))
# Output:
Tuned Hyperparameters : {'regressor__colsample_bytree': 0.7, 'regressor__grow_policy': 'lossguide', 'regressor__max_leaves': 0, 'regressor__objective': 'reg:squarederror'}
Tuned Hyperparameters : {'max_depth': 4, 'max_features': 'sqrt', 'n_estimators': 200}
Note: Delete the classifier
tag prepended to the parameter names for RandomForestclassifier in your case.