Search code examples
scikit-learnclassificationmultilabel-classificationlightgbm

sklearn HistGradientBoostingClassifier with large unbalanced data


I've been using Sklearn HistGradientBoostingClassifier to classify some data. My experiment is multi-class classification with single label predictions (20 labels).

My experience shows two cases. The first case is the measurement of the accuracy of these algorithms without data augmentation (around unbalanced 3,000 samples). The second case is the measurement of accuracy with data augmentation (around 12,000 unbalanced samples). I am using default parameters.

In the first case, the HistGradientBoostingClassifier shows an accuracy of around 86.0%. However, with data augmentation, results show weak accuracy, around 23%.

I am wondering if this accuracy was coming from unbalanced datasets, but since there are no features to fix unbalanced datasets for the HistGradientBoostingClassifier algorithm within the Sklearn library, I cannot verify that fact.

Do some people have the same kind of problem with large dataset and HistGradientBoostingClassifier?

Edit: I tried other algorithms with the same data split, and the results seems normal (accuracy around 5% more w/ data augmentation). I am wondering why I am only getting this with HistGradientBoostingClassifier.


Solution

  • I have found something that could be the reason of the lack of accuracy while using HistGradientBoostingClassifier algorithm with default parameters on augmented dataset of roughly 12,000 samples.

    I compared HistGradientBoostingClassifier and LightGBM algorithms on the same data split (HistGradientBoostingClassifier from sklearn is an implementation of Microsoft's LightGBM.). HistGradientBoostingClassifier shows a weak accuracy of 24.7% and LightGBM a strong one 87.5%.

    As I can read on sklearn's and Microsoft's docs, HistGradientBoostingClassifier "cannot handle properly" unbalanced dataset while LightGBM can. The latter has this parameter: class_weigth (dict, 'balanced' or None, optional (default=None)) (found on that page)

    My hypothesis is that, for the time being, the dataset becomes more unbalanced with augmentation and, without any feature for the HistGradientBoostingClassifier algorithm to handle unbalanced data, the algorithm is misled. Also, as mentioned by Hanafi Haffidz in comments the algorithm could tend to overfit with default parameters.