I am trying to train XGBOOST in a binary classification setting, with positive to negative instances at a 1:5 ratio. My data draws parallels to the likes of cancer detection, i.e. FNs are much more costly than FPs. After quite a bit of reading, I am still confused about the following:
First, is it necessary for me to balance the classes e.g by over-sampling? I have a data size of around 160,000, with many entries containing NaN for certain columns. Regarding XGBOOST in particular, I know it is common to adjust scale_pos_weight
, but the documentation (https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html) notes that this is mainly for overall AUC performance. The main metric I care about is recall, but also accuracy to an extent.
Secondly, what metric should I try to maximise in the hyper-parameter tuning?
Thank you for your help.
FNs are much more costly than FPs
You can create your own objective function based on some estimation of cost of false negatives and false positives. The documentation is here, here is the example based on which you can get inspired:
from sklearn.metrics import confusion_matrix
def your_objective(predt: np.ndarray, dtrain: xgb.DMatrix) -> Tuple[str, float]:
y = (dtrain.get_label() > 0.5) * 1
tn, fp, fn, tp = confusion_matrix(y, predt).ravel()
your_gain = true_positive_cost * tp - false_positive_cost * fp
max_gain = true_positive_cost * (fn + tp)
result = your_gain / max_gain
return 'your_objective', result
xgb.train(your_params,
dtrain=dtrain,
num_boost_round=10,
obj='binary:hinge',
feval=your_objective,
evals=[(dtrain, 'dtrain'), (dtest, 'dtest')],
evals_result=results)
You just need to define true_positive_cost
and false_positive_cost
.
And yes, do perform adjustment of classes with scale_pos_weight
depending on the ratios of the classes in the dtrain
.