Search code examples
pythonpython-3.xscikit-learnwarningserror-logging

Saving sklearn warnings to a dataframe


I am using sklearn's GridSearchCV to optimize parameters for Adaboost classifiers with different datasets. I then create/add to a DatafFrame that has information like the dataset name, best_params_, and best_score_.

Sometimes I get warnings such as a ConvergenceWarning, or just a deprecated package. They don't necessarily hurt anything, but I would like to add them as a column.

This post (Writing scikit-learn verbose log into an external file) seems to get close with bluesummers' and mbil's mesages, but I don't really want to write a file to read back in to my dataframe.

Here is a minimal working example. For the DataFrame at the end it currently fills NA for the "warnings" columns. However, because I'm using AdaBoostClassifier(base_estimator=RandomForestClassifier()) instead of AdaBoostClassifier(estimator=RandomForestClassifier()) I should be getting a bunch of errors that I would like to grab ans save in the warnings column.

from sklearn.model_selection import GridSearchCV, KFold, cross_val_score,StratifiedKFold
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
import numpy as np
import tqdm as tq
import pandas as pd
from sklearn.preprocessing import StandardScaler

df_params = pd.DataFrame(columns=['learning_rate', 'n_estimators', 'accuracy', 'warning'])
abc = AdaBoostClassifier(base_estimator=RandomForestClassifier())

parameters = {'n_estimators':[5,10],
              'learning_rate':[0.01,0.2]}

a = np.random.random((50, 3))
b = np.random.random((70, 3))
c = np.random.random((50, 5))


for i, data in tq.tqdm(enumerate([a,b,c])):
    X = data
    sc =StandardScaler()
    X = sc.fit_transform(X)
    y = ['foo', 'bar']*int(len(X)/2)
    
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=None)
    clf = GridSearchCV(abc, parameters, cv=skf, scoring='accuracy', n_jobs=-1,)
    clf.fit(X,y)
    
    dict_best_params = clf.best_params_.copy()
    dict_best_params['accuracy'] = clf.best_score_
    best_params = pd.DataFrame(dict_best_params, index=[i])
    df_params = pd.concat([df_params, best_params], ignore_index=False)

df_params.head()

Solution

  • IIUC, you can use catch_warning:

    import warnings  # HERE
    import numpy as np
    import tqdm as tq
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    
    df_params = pd.DataFrame(columns=['learning_rate', 'n_estimators', 'accuracy', 'warning'])
    abc = AdaBoostClassifier(base_estimator=RandomForestClassifier())
    
    parameters = {'n_estimators':[5,10],
                  'learning_rate':[0.01,0.2]}
    
    a = np.random.random((50, 3))
    b = np.random.random((70, 3))
    c = np.random.random((50, 5))
    
    
    warns = []
    for i, data in tq.tqdm(enumerate([a,b,c])):
        with warnings.catch_warnings(record=True) as cx_manager:  # HERE
            X = data
            sc =StandardScaler()
            X = sc.fit_transform(X)
            y = ['foo', 'bar']*int(len(X)/2)
        
            skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=None)
            clf = GridSearchCV(abc, parameters, cv=skf, scoring='accuracy', n_jobs=-1,)
            clf.fit(X,y)
        
            dict_best_params = clf.best_params_.copy()
            dict_best_params['accuracy'] = clf.best_score_
            dict_best_params['warning'] = [i.message for i in cx_manager]  # HERE
            best_params = pd.DataFrame(dict_best_params, index=[i])
            df_params = pd.concat([df_params, best_params], ignore_index=False)
    

    Output:

    >>> df_params
       learning_rate n_estimators  accuracy                                            warning
    0           0.20           10  0.520000  `base_estimator` was renamed to `estimator` in...
    1           0.20           10  0.514286  `base_estimator` was renamed to `estimator` in...
    2           0.01            5  0.440000  `base_estimator` was renamed to `estimator` in...