Search code examples
pythonmachine-learningxgboostgradient-descentsupervised-learning

How to evaluate the xgboost classification model stability


I have:

  1. Python xgboost classification model
  2. Weekly datasets (basis of classification) since the begining of 2018. Each of dataset has about 100 thousand rows and 70 columns (features).
  3. weekly prediction results on datasets via xgboost model (using logistic regression) in the format:
- date of modelling
- items
- test_auc_mean for each item (in percentage).

In total there are about 100 datasets and 100 prediction_results since January 2018.

To assess the model I use such metrics as:

-auc

-confusion matrix

-accuracy

param = {
    'num_parallel_tree':num_parallel_tree,
    'subsample':subsample,
    'colsample_bytree':colsample_bytree,
    'objective':objective, 
    'learning_rate':learning_rate, 
    'eval_metric':eval_metric, 
    'max_depth':max_depth,
    'scale_pos_weight':scale_pos_weight,
    'min_child_weight':min_child_weight,
    'nthread':nthread,
    'seed':seed
}

bst_cv = xgb.cv(
    param, 
    dtrain,  
    num_boost_round=n_estimators, 
    nfold = nfold,
    early_stopping_rounds=early_stopping_rounds,
    verbose_eval=verbose,
    stratified = stratified
)

test_auc_mean = bst_cv['test-auc-mean']
best_iteration = test_auc_mean[test_auc_mean == max(test_auc_mean)].index[0]

bst = xgb.train(param, 
                dtrain, 
                num_boost_round = best_iteration)

best_train_auc_mean = bst_cv['train-auc-mean'][best_iteration]
best_train_auc_mean_std = bst_cv['train-auc-std'][best_iteration]

best_test_auc_mean = bst_cv['test-auc-mean'][best_iteration]
best_test_auc_mean_std = bst_cv['test-auc-std'][best_iteration]

print('''XGB CV model report
Best train-auc-mean {}% (std: {}%) 
Best test-auc-mean {}% (std: {}%)'''.format(round(best_train_auc_mean * 100, 2), 
                                          round(best_train_auc_mean_std * 100, 2), 
                                          round(best_test_auc_mean * 100, 2), 
                                          round(best_test_auc_mean_std * 100, 2)))

y_pred = bst.predict(dtest)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred>0.9).ravel()


print('''
     | neg | pos |
__________________
true_| {}  | {}  |
false| {}  | {}  |
__________________

'''.format(tn, tp, fn, fp))

predict_accuracy_on_test_set = (tn + tp)/(tn + fp + fn + tp)
print('Test Accuracy: {}%'.format(round(predict_accuracy_on_test_set * 100, 2)))

The model gives me general picture (as usually, auc is between .94 and .96) The problem is that the variability of predicting of some specific items is very high (today an item is positive, tomorrow an item is negative, the day after tomorrow - positive again)

I wanna evaluate the model' stability. In other words, I wanna know, how many items with variable results does it generate. In the end, I wanna be ensured, that the model will generate stable results with minimal fluctuation. Do you have some thoughts how to do this?


Solution

  • That's precisely the goal of cross validation. Since you already did it, you can only evaluate standard deviation of your evaluation metrics, you already did it aswell...

    1. You can try some new metrics, like precision,recall,f1 score or fn score to weight success and failure differently but it looks like your almost out of solutions. You're dependant to your data input here :s

    2. You could spend some time on training population distribution, and try to identify which part of the population fluctuate over time.

    3. You could also try to predict proba and not classification to evaluate if the model is far above its threshold or not.

    The last two solution are more like side solutions. :(