python machine-learning classification metrics

Metrics to consider for heavily imbalanced dataset

I am trying to train a GradientBoosting model on a heavily imbalanced data in Python. Class distribution is like 0.96 : 0.04 for class 0 and class 1 respectively.

After some parameter tuning considering the recall and precision scores I came up with a good model. Different metrics scores are like given below for validation set. Also, it is close to the Cross Validation Scores.

recall : 0.928777 precision : 0.974747 auc : 0.9636 kappa : 0.948455 f1 weighted : 0.994728

If I want to tune the model further, which metrics should I consider to increase.? In my problem miss-classifying 1 as 0 is more problematic than miss-predicting 0 as 1.

Solution

There are various techniques to work with when dealing with Class imbalance issue. Few as stated below:

(Links include pythons imblearn package and costcla package)

Resample:
- Undersample majority class (class 0 in your case) You can try random undersampling for starters.
- Oversample the minority class (Class 1). Explore SMOTE/ADASYN techniques.
Ensemble Techniques:
- Bagging/Boosting techniques.
Cost-sensitive Learning: You should definitely explore this since you have mentioned:

In my problem miss-classifying 1 as 0 is more problematic than miss-predicting 0 as 1.

In cost sensitive learning using costcla package, you should try the following approach, keeping your base classifier as GradientBoostingRegressor:

costcla.sampling.cost_sampling(X, y, cost_mat, method='RejectionSampling', oversampling_norm=0.1, max_wc=97.5)

Here you can load a cost_mat[C_FP,C_FN,C_TP,C_TN] for each data point in train and test. C_FP and C_FN are based on the misclassification cost that you want to set for positives and negatives classes. Refer to the full tutorial on credit score data here.