Search code examples
pythonscikit-learnclassificationgrid-searchgridsearchcv

Iterating GridSearchCV over multiple datasets gives identical result for each


I am trying to perform grid search in Scikit-learn for a specific algorithm with different hyperparameters over multiple train datasets stored into a dedicated dictionary. First, I call the different hyperparams and the model to be used:

scoring = ['accuracy', 'balanced_accuracy', 'f1', 'precision', 'recall']
grid_search = {}

for key in X_train_d.keys():
    cv = StratifiedKFold(n_splits=5, random_state=1)
    model = XGBClassifier(objective="binary:logistic", random_state=42)
    space = dict()
    space['n_estimators']=[50] # 200
    space['learning_rate']= [0.5] #0.01, 0.3, 0.5
    grid_search= GridSearchCV(model, space, scoring=scoring, cv=cv, n_jobs=3, verbose=2, refit='balanced_accuracy')

Then, I create an empty dictionary that should be populated with as many GridSearchCV objects as X_train_d.keys(), via:

grid_result = {}    
for key in X_train_d.keys():
    grid_result[key] = grid_search.fit(X_train_d[key], Y_train_d[key])

Finally, I create as many datasets as the existing keys reporting info on scoring etc. via:

df_grid_results = {}
for key in X_train_d.keys():
    df_grid_results[key]=pd.DataFrame(grid_search.cv_results_)
    df_grid_results[key] = (
    df_grid_results[key]
    .set_index(df_grid_results[key]["params"].apply(
        lambda x: "_".join(str(val) for val in x.values()))
    )
    .rename_axis('kernel')
    )

All is working "perfectly" - in the sense that no error is shown - except that when I inspect either the different GridSearchCV objects or the df_grid_results datasets, I see that results are all identical as if the models were fit on the same dataset over and over again, while the X_train_d and Y_train_d dictionaries contain different datasets.

Of course, when I fit a model individually, like:

model1_cv = grid_search.fit(X_train_d[1], Y_train_d[1])
model2_cv = grid_search.fit(X_train_d[2], Y_train_d[2])

results differ as expected.

I feel like I am missing something really stupid and obvious here. Anybody can help? Thanks!


Solution

  • As pointed out by Malo the problem is that in the last loop you are copy-pasting the grid search results for the last dataset in all data frames. However, the multiple loops in your code are not really needed, you can simplify your code to run only one loop and to save the results directly in a data frame as follows:

    import numpy as np
    import pandas as pd
    from xgboost import XGBClassifier
    from sklearn.model_selection import StratifiedKFold, GridSearchCV
    
    # features datasets
    X_train_d = {
        'd1': np.random.normal(0, 1, (100, 3)), 
        'd2': np.random.normal(0, 1, (100, 5))
    }
    
    # labels datasets
    Y_train_d = {
        'd1': np.random.choice([0, 1], 100), 
        'd2': np.random.choice([0, 1], 100)
    }
    
    # parameter grid
    param_grid = {'n_estimators': [50, 100], 'learning_rate': [0.3, 0.5]}
    
    # evaluation metrics
    scoring = ['accuracy', 'balanced_accuracy', 'f1', 'precision', 'recall']
    
    # cross-validation splits
    cv = StratifiedKFold(n_splits=5)
    
    # results data frame
    df_grid_results = pd.DataFrame()
    
    for key in X_train_d.keys():
    
        # run the grid search
        grid_search = GridSearchCV(
            estimator=XGBClassifier(objective='binary:logistic', random_state=42), 
            param_grid=param_grid, 
            scoring=scoring, 
            cv=cv, 
            n_jobs=3, 
            verbose=2, 
            refit='balanced_accuracy'
        )
        
        grid_search.fit(X_train_d[key], Y_train_d[key])
        
        # save the grid search results in the data frame
        df_temp = pd.DataFrame(grid_search.cv_results_)
        df_temp['dataset'] = key
        
        df_grid_results = df_grid_results.append(df_temp, ignore_index=True)
    
    df_grid_results = df_grid_results.set_index(df_grid_results['params'].apply(lambda x: '_'.join(str(val) for val in x.values()))).rename_axis('kernel')
    
    print(df_grid_results[['dataset', 'mean_test_accuracy', 'mean_test_balanced_accuracy', 'mean_test_f1', 'mean_test_precision', 'mean_test_recall']])
    #         dataset  mean_test_accuracy  mean_test_balanced_accuracy  mean_test_f1  mean_test_precision  mean_test_recall  
    # kernel                                                             
    # 0.3_50       d1                0.40                     0.403232      0.349067             0.399953          0.335556  
    # 0.3_100      d1                0.38                     0.382323      0.356022             0.368983          0.355556  
    # 0.5_50       d1                0.43                     0.429596      0.351857             0.391209          0.335556  
    # 0.5_100      d1                0.41                     0.409596      0.342767             0.365812          0.335556  
    # 0.3_50       d2                0.55                     0.540025      0.448419             0.501948          0.436111
    # 0.3_100      d2                0.57                     0.556692      0.462381             0.515996          0.436111  
    # 0.5_50       d2                0.62                     0.607449      0.536695             0.587857          0.502778  
    # 0.5_100      d2                0.64                     0.629672      0.571682             0.607857          0.547222