I am trying to train a GradientBoosting
model on a heavily imbalanced data in Python
. Class distribution is like 0.96 : 0.04
for class 0 and class 1 respectively.
After some parameter tuning considering the recall
and precision
scores I came up with a good model. Different metrics scores are like given below for validation set. Also, it is close to the Cross Validation Scores.
recall : 0.928777
precision : 0.974747
auc : 0.9636
kappa : 0.948455
f1 weighted : 0.994728
If I want to tune the model further, which metrics should I consider to increase.? In my problem miss-classifying 1 as 0 is more problematic than miss-predicting 0 as 1.
There are various techniques to work with when dealing with Class imbalance issue. Few as stated below:
(Links include pythons imblearn
package and costcla
package)
Resample:
Ensemble Techniques:
Cost-sensitive Learning: You should definitely explore this since you have mentioned:
In my problem miss-classifying 1 as 0 is more problematic than miss-predicting 0 as 1.
In cost sensitive learning using costcla
package, you should try the following approach, keeping your base classifier as GradientBoostingRegressor:
costcla.sampling.cost_sampling(X, y, cost_mat, method='RejectionSampling', oversampling_norm=0.1, max_wc=97.5)
Here you can load a cost_mat[C_FP,C_FN,C_TP,C_TN] for each data point in train and test. C_FP and C_FN are based on the misclassification cost that you want to set for positives and negatives classes. Refer to the full tutorial on credit score data here.