Search code examples
pythonscikit-learnclassificationdata-scienceimbalanced-data

Cost Sensitive Classifier fails for heavily imbalanced datasets


I am going to try to keep this as specific as possible but it is kind of a general question as well. I have a heavily skewed dataset in the order of { 'Class 0': 0.987, 'Class 1':0.012 } I would like to have a set of classifiers that work well on such datasets and then create an ensemble learner of those models. I do not think I want to oversample or undersample. I definitely dont want to SMOTE because they don't scale well for high dimensional data/ or result in a very large number of data points. I want to use a cost sensitive approach to creating my classifiers and hence came across the class_weight=balanced parameter in the scikit-learn library. However, it doesn't seem to be helping me much because my F1 scores are still very terrible (in the range of 0.02 etc.) I have also tried using sklearn.utils.class_weight.compute_class_weight to manually calculate the weights, store them in a dictionary and pass it as a parameter to the class_weight parameter, however I see no improvement in F1 score and my False Positives are still very high(around 5k) and everything else quite low(less than 50). I don't understand what I am missing. Am I implementing something wrong? What else can I do to tackle my problem? When I change my evaluation metric from f1_score(average='binary') to f1_score(average='weighted') the F1 score increases from ~0.02 to ~98.66, which I think is probably wrong. Any kind of help including references to how I could tackle this problem will be very helpful.

I am trying to implement XGBOOST, CATBoost, LightGBM, Logistic Regression,SVC('linear'),Random Forest Classifiers


Solution

  • I realized that this question arose due to pure naivete. I resolved my problem by using the imbalanced-learn Python library. Algorithms like imblearn.ensemble.EasyEnsembleClassifier are a godsend when it comes to heavy imbalanced classification where the minority class is more important than the majority class. For anyone having troubles like this I suggest trying to find a different algorithm other than your usual favorites that will help you solve your problem.