Search code examples
scikit-learncross-validationxgboostaucoptuna

strange behavior of roc_auc_score, 'roc_auc', 'auc'


While optimizing parameters for xgboost I encountered a problem with the roc_auc_score metric. I get significantly different results during cross-validation compared to the results on the training data.

class OptunaHyperparamsSearch:
def __init__(self, X_train, y_train, **kwargs):
    ...

def objective(self, trial):

    ...

    cv_results = xgb.cv(param, self.dtrain, num_boost_round=5, metrics=['auc'], nfold=5, verbose_eval=True)

    mean_auc = cv_results['test-auc-mean'].max()
    boost_rounds = cv_results['test-auc-mean'].idxmax()

    param['n_estimators'] = boost_rounds
    trial.set_user_attr('param', param)

    print('boost_rounds: ', boost_rounds)
    print('train-auc-mean', cv_results['train-auc-mean'][boost_rounds])

    return mean_auc

def best_model(self, n_trials=100, save_path=None):

    study = optuna.create_study(direction="maximize")
    study.optimize(self.objective, n_trials=n_trials)

    best_params = study.best_trial.user_attrs['param']
    best_model = xgb.XGBClassifier(**best_params)
    best_model.fit(self.X_train, self.y_train)

    return best_model

After running code:

search = OptunaHyperparamsSearch(X_train, y_train)
model = search.best_model(n_trials=1)

I received:

[0] train-auc:0.777869+0.00962852   test-auc:0.771169+0.025347
[1] train-auc:0.786905+0.00865646   test-auc:0.777492+0.0255523
[2] train-auc:0.793305+0.00480249   test-auc:0.785307+0.0198732
[3] train-auc:0.79595+0.00349561    test-auc:0.789897+0.0158569
[4] train-auc:0.796818+0.00407504   test-auc:0.789997+0.016069
boost_rounds:  4
train-auc-mean 0.796818
[I 2020-06-04 10:12:25,093] Finished trial#0 with value: 0.7899968 with parameters: {'booster': 'dart', 'reg_lambda': 0.8001057111479173, 'reg_alpha': 0.0016960618598770582, 'max_depth': 8, 'min_child_weight': 4, 'learning_rate': 0.0602235073221647, 'gamma': 0.0011248451567255984, 'colsample_bytree': 0.911487203002922, 'subsample': 0.9057485217255851, 'grow_policy': 'lossguide', 'scale_pos_weight': 0.5865962792358733, 'sample_type': 'weighted', 'normalize_type': 'tree', 'rate_drop': 0.0009459988874640169, 'skip_drop': 8.103200442539776e-05}. Best is trial#0 with value: 0.7899968.

So the result is about 0.8 (train-auc-mean 0.796818). And after that running:

y_pred = model.predict(X_train)
print(roc_auc_score(y_train, y_pred))

I received:

0.598231710442728

So it's impossible. I tried also use customize function:

from sklearn.metrics import roc_auc_score

def PyAUC(predt: np.ndarray, dtrain: xgb.DMatrix):
    y = dtrain.get_label()
    return 'PyAUC', roc_auc_score(y, predt)

and pass them by feval to xgb.cv, setting param['disable_default_eval_metric'] = 1 and without defining metrics and the result was the same.

Then I tried to use RandomizedSearchCV:

params = {
    'min_child_weight': [1, 5, 10],
    'gamma': [0.5, 1, 1.5, 2, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'max_depth': [3, 4, 5]
    }
alg = XGBClassifier(learning_rate=0.01, n_estimators=5, objective='binary:logistic',
                silent=True, nthread=1)
skf = StratifiedKFold(n_splits=5, shuffle = True, random_state = 1001)

random_search = RandomizedSearchCV(alg, param_distributions=params, n_iter=10, scoring='roc_auc', n_jobs=4, cv=skf.split(X_train, y_train), verbose=3, random_state=1001 )

random_search.fit(X_train, y_train)

print('\n All results:')
print(random_search.cv_results_)

y_pred = random_search.predict(X_train)
print(roc_auc_score(y_train, y_pred))

The output was:

All results:
{'mean_fit_time': array([0.27621794, 0.40631523, 0.36202598, 0.32188687, 0.34574351,
   0.2747798 , 0.31780529, 0.32190156, 0.34060073, 0.25945067]), 'std_fit_time': array([0.02603387, 0.04572275, 0.09460844, 0.01841953, 0.08391794,
   0.03654419, 0.01583525, 0.03670047, 0.01035465, 0.03085039]), 'mean_score_time': array([0.01927972, 0.0143033 , 0.01697631, 0.01260743, 0.02442002,
   0.02089334, 0.0182806 , 0.0132216 , 0.01498265, 0.01320119]), 'std_score_time': array([0.00609847, 0.00671443, 0.00613005, 0.00410744, 0.00384849,
   0.00516041, 0.00505873, 0.00276774, 0.00023382, 0.00546102]), 'param_subsample': masked_array(data=[1.0, 0.6, 0.8, 1.0, 0.8, 1.0, 1.0, 0.8, 0.8, 0.8],
         mask=[False, False, False, False, False, False, False, False,
               False, False],
   fill_value='?',
        dtype=object), 'param_min_child_weight': masked_array(data=[5, 1, 5, 5, 1, 10, 1, 1, 1, 1],
         mask=[False, False, False, False, False, False, False, False,
               False, False],
   fill_value='?',
        dtype=object), 'param_max_depth': masked_array(data=[3, 5, 5, 5, 4, 4, 5, 3, 5, 4],
         mask=[False, False, False, False, False, False, False, False,
               False, False],
   fill_value='?',
        dtype=object), 'param_gamma': masked_array(data=[5, 1.5, 1, 5, 1, 1.5, 5, 2, 0.5, 1.5],
         mask=[False, False, False, False, False, False, False, False,
               False, False],
   fill_value='?',
        dtype=object), 'param_colsample_bytree': masked_array(data=[1.0, 0.8, 0.8, 0.6, 1.0, 0.6, 0.6, 0.8, 0.6, 0.6],
         mask=[False, False, False, False, False, False, False, False,
               False, False],
   fill_value='?',
        dtype=object), 'params': [{'subsample': 1.0, 'min_child_weight': 5, 'max_depth': 3, 'gamma': 5, 'colsample_bytree': 1.0}, {'subsample': 0.6, 'min_child_weight': 1, 'max_depth': 5, 'gamma': 1.5, 'colsample_bytree': 0.8}, {'subsample': 0.8, 'min_child_weight': 5, 'max_depth': 5, 'gamma': 1, 'colsample_bytree': 0.8}, {'subsample': 1.0, 'min_child_weight': 5, 'max_depth': 5, 'gamma': 5, 'colsample_bytree': 0.6}, {'subsample': 0.8, 'min_child_weight': 1, 'max_depth': 4, 'gamma': 1, 'colsample_bytree': 1.0}, {'subsample': 1.0, 'min_child_weight': 10, 'max_depth': 4, 'gamma': 1.5, 'colsample_bytree': 0.6}, {'subsample': 1.0, 'min_child_weight': 1, 'max_depth': 5, 'gamma': 5, 'colsample_bytree': 0.6}, {'subsample': 0.8, 'min_child_weight': 1, 'max_depth': 3, 'gamma': 2, 'colsample_bytree': 0.8}, {'subsample': 0.8, 'min_child_weight': 1, 'max_depth': 5, 'gamma': 0.5, 'colsample_bytree': 0.6}, {'subsample': 0.8, 'min_child_weight': 1, 'max_depth': 4, 'gamma': 1.5, 'colsample_bytree': 0.6}], 'split0_test_score': array([0.75734333, 0.78965043, 0.78929122, 0.77842559, 0.78669592,
   0.77856369, 0.7803955 , 0.77733652, 0.78884686, 0.77706318]), 'split1_test_score': array([0.7564997 , 0.78553601, 0.78621578, 0.77250155, 0.78589665,
   0.77237991, 0.77235486, 0.77187115, 0.78573708, 0.77046652]), 'split2_test_score': array([0.75575839, 0.77356843, 0.79002323, 0.77134164, 0.76641651,
   0.76965581, 0.77133806, 0.76749842, 0.79029943, 0.77043647]), 'split3_test_score': array([0.74596394, 0.77188117, 0.76967513, 0.76816388, 0.76832059,
   0.76795065, 0.76942182, 0.76217902, 0.76846871, 0.75720452]), 'split4_test_score': array([0.78099172, 0.80616938, 0.80491224, 0.80371433, 0.81990511,
   0.82052725, 0.80327483, 0.80598102, 0.8171982 , 0.8052647 ]), 'mean_test_score': array([0.75931142, 0.78536108, 0.78802352, 0.7788294 , 0.78544696,
   0.78181546, 0.77935701, 0.77697323, 0.79011006, 0.77608708]), 'std_test_score': array([0.01159822, 0.0124273 , 0.0112318 , 0.01287854, 0.01920727,
   0.01968907, 0.01253142, 0.0153379 , 0.01563886, 0.01595216]), 'rank_test_score': array([10,  4,  2,  7,  3,  5,  6,  8,  1,  9], dtype=int32)}
0.6093407594278569

So still the same problem: during cross-validation score about 0.8 and after that 0.6. I suppose that different metrics are used.

The solution I found was to pass in RandomizedSearchCV: scoring=make_scorer(roc_auc_score). This solved the problem giving the same result in cross-validation and after that about 0.6.

Can anyone explain what the problem was because I still don't understand it? And I still don't know how to solve it using optuna optimalization.


Solution

  • You're using model.predict, but the ROC curve and roc_auc_score needs the predicted probabilities (or other confidence measures, maybe); use model.predict_proba.

    Scikit-learn : roc_auc_score