Search code examples
pythonmachine-learningclassificationmetrics

Metrics to consider for heavily imbalanced dataset


I am trying to train a GradientBoosting model on a heavily imbalanced data in Python. Class distribution is like 0.96 : 0.04 for class 0 and class 1 respectively.

After some parameter tuning considering the recall and precision scores I came up with a good model. Different metrics scores are like given below for validation set. Also, it is close to the Cross Validation Scores.

recall : 0.928777 precision : 0.974747 auc : 0.9636 kappa : 0.948455 f1 weighted : 0.994728

If I want to tune the model further, which metrics should I consider to increase.? In my problem miss-classifying 1 as 0 is more problematic than miss-predicting 0 as 1.


Solution

  • There are various techniques to work with when dealing with Class imbalance issue. Few as stated below:

    (Links include pythons imblearn package and costcla package)

    1. Resample:

    2. Ensemble Techniques:

    3. Cost-sensitive Learning: You should definitely explore this since you have mentioned:

    In my problem miss-classifying 1 as 0 is more problematic than miss-predicting 0 as 1.

    In cost sensitive learning using costcla package, you should try the following approach, keeping your base classifier as GradientBoostingRegressor:

    costcla.sampling.cost_sampling(X, y, cost_mat, method='RejectionSampling', oversampling_norm=0.1, max_wc=97.5)
    

    Here you can load a cost_mat[C_FP,C_FN,C_TP,C_TN] for each data point in train and test. C_FP and C_FN are based on the misclassification cost that you want to set for positives and negatives classes. Refer to the full tutorial on credit score data here.