Search code examples
python-3.xscikit-learnxgboost

How to reduce false positives in xgboost?


My dataset is evenly split between 0 and 1 classifiers. 100,000 data points total with 50,000 being classified as 0 and another 50,000 classified as 1. I did an 80/20 split to train/test the data and returned a 98% accuracy score. However, when looking at the confusion matrix I have an awful lot of false positives. I'm new to xgboost and decision trees in general. What settings can I change in the XGBClassifier to reduce the number of false positives or is it even possible? Thank you.

enter image description here

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0, stratify=y) # 80% training and 20% test

model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=9,
              min_child_weight=1, missing=None, monotone_constraints='()',
              n_estimators=180, n_jobs=4, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

model.fit(X_train,
           y_train,
           verbose = True, 
           early_stopping_rounds=10,
           eval_metric = "aucpr",
           eval_set = [(X_test, y_test)])

plot_confusion_matrix(model,
                      X_test,
                      y_test,
                      values_format='d',
                      display_labels=['Old Forests', 'Not Old Forests'])

Solution

  • Yes If you are looking for a simple fix, you lower the value of scale_pos_weight. This will lower false positive rate even though your dataset is balanced.

    For a more robust fix, you will need to run hyperparamter tuning search. Especially you should try different values of : scale_pos_weight, alpha, lambda, gamma and min_child_weight. Since they are the ones with the most impact on how conservative the model is going to be.