My dataset is evenly split between 0 and 1 classifiers. 100,000 data points total with 50,000 being classified as 0 and another 50,000 classified as 1. I did an 80/20 split to train/test the data and returned a 98% accuracy score. However, when looking at the confusion matrix I have an awful lot of false positives. I'm new to xgboost and decision trees in general. What settings can I change in the XGBClassifier
to reduce the number of false positives or is it even possible? Thank you.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0, stratify=y) # 80% training and 20% test
model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.1, max_delta_step=0, max_depth=9,
min_child_weight=1, missing=None, monotone_constraints='()',
n_estimators=180, n_jobs=4, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', use_label_encoder=False,
validate_parameters=1, verbosity=None)
model.fit(X_train,
y_train,
verbose = True,
early_stopping_rounds=10,
eval_metric = "aucpr",
eval_set = [(X_test, y_test)])
plot_confusion_matrix(model,
X_test,
y_test,
values_format='d',
display_labels=['Old Forests', 'Not Old Forests'])
Yes If you are looking for a simple fix, you lower the value of scale_pos_weight. This will lower false positive rate even though your dataset is balanced.
For a more robust fix, you will need to run hyperparamter tuning search. Especially you should try different values of : scale_pos_weight, alpha, lambda, gamma and min_child_weight. Since they are the ones with the most impact on how conservative the model is going to be.