I am trying to perform grid search in Scikit-learn for a specific algorithm with different hyperparameters over multiple train datasets stored into a dedicated dictionary. First, I call the different hyperparams and the model to be used:
scoring = ['accuracy', 'balanced_accuracy', 'f1', 'precision', 'recall']
grid_search = {}
for key in X_train_d.keys():
cv = StratifiedKFold(n_splits=5, random_state=1)
model = XGBClassifier(objective="binary:logistic", random_state=42)
space = dict()
space['n_estimators']=[50] # 200
space['learning_rate']= [0.5] #0.01, 0.3, 0.5
grid_search= GridSearchCV(model, space, scoring=scoring, cv=cv, n_jobs=3, verbose=2, refit='balanced_accuracy')
Then, I create an empty dictionary that should be populated with as many GridSearchCV objects as X_train_d.keys(), via:
grid_result = {}
for key in X_train_d.keys():
grid_result[key] = grid_search.fit(X_train_d[key], Y_train_d[key])
Finally, I create as many datasets as the existing keys reporting info on scoring etc. via:
df_grid_results = {}
for key in X_train_d.keys():
df_grid_results[key]=pd.DataFrame(grid_search.cv_results_)
df_grid_results[key] = (
df_grid_results[key]
.set_index(df_grid_results[key]["params"].apply(
lambda x: "_".join(str(val) for val in x.values()))
)
.rename_axis('kernel')
)
All is working "perfectly" - in the sense that no error is shown - except that when I inspect either the different GridSearchCV objects or the df_grid_results datasets, I see that results are all identical as if the models were fit on the same dataset over and over again, while the X_train_d and Y_train_d dictionaries contain different datasets.
Of course, when I fit a model individually, like:
model1_cv = grid_search.fit(X_train_d[1], Y_train_d[1])
model2_cv = grid_search.fit(X_train_d[2], Y_train_d[2])
results differ as expected.
I feel like I am missing something really stupid and obvious here. Anybody can help? Thanks!
As pointed out by Malo the problem is that in the last loop you are copy-pasting the grid search results for the last dataset in all data frames. However, the multiple loops in your code are not really needed, you can simplify your code to run only one loop and to save the results directly in a data frame as follows:
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, GridSearchCV
# features datasets
X_train_d = {
'd1': np.random.normal(0, 1, (100, 3)),
'd2': np.random.normal(0, 1, (100, 5))
}
# labels datasets
Y_train_d = {
'd1': np.random.choice([0, 1], 100),
'd2': np.random.choice([0, 1], 100)
}
# parameter grid
param_grid = {'n_estimators': [50, 100], 'learning_rate': [0.3, 0.5]}
# evaluation metrics
scoring = ['accuracy', 'balanced_accuracy', 'f1', 'precision', 'recall']
# cross-validation splits
cv = StratifiedKFold(n_splits=5)
# results data frame
df_grid_results = pd.DataFrame()
for key in X_train_d.keys():
# run the grid search
grid_search = GridSearchCV(
estimator=XGBClassifier(objective='binary:logistic', random_state=42),
param_grid=param_grid,
scoring=scoring,
cv=cv,
n_jobs=3,
verbose=2,
refit='balanced_accuracy'
)
grid_search.fit(X_train_d[key], Y_train_d[key])
# save the grid search results in the data frame
df_temp = pd.DataFrame(grid_search.cv_results_)
df_temp['dataset'] = key
df_grid_results = df_grid_results.append(df_temp, ignore_index=True)
df_grid_results = df_grid_results.set_index(df_grid_results['params'].apply(lambda x: '_'.join(str(val) for val in x.values()))).rename_axis('kernel')
print(df_grid_results[['dataset', 'mean_test_accuracy', 'mean_test_balanced_accuracy', 'mean_test_f1', 'mean_test_precision', 'mean_test_recall']])
# dataset mean_test_accuracy mean_test_balanced_accuracy mean_test_f1 mean_test_precision mean_test_recall
# kernel
# 0.3_50 d1 0.40 0.403232 0.349067 0.399953 0.335556
# 0.3_100 d1 0.38 0.382323 0.356022 0.368983 0.355556
# 0.5_50 d1 0.43 0.429596 0.351857 0.391209 0.335556
# 0.5_100 d1 0.41 0.409596 0.342767 0.365812 0.335556
# 0.3_50 d2 0.55 0.540025 0.448419 0.501948 0.436111
# 0.3_100 d2 0.57 0.556692 0.462381 0.515996 0.436111
# 0.5_50 d2 0.62 0.607449 0.536695 0.587857 0.502778
# 0.5_100 d2 0.64 0.629672 0.571682 0.607857 0.547222