Search code examples
pythonmachine-learningclassificationrandom-forestsupervised-learning

Imbalanced data: undersampling or oversampling?


I have binary classification problem where one class represented 99.1% of all observations (210 000). As a strategy to deal with the imbalanced data, I choose sampling techniques. But I don't know what to do: undersampling my majority class or oversampling the less represented class. If anybody have an advise?

Thank you.

P.s. I use random forest algorithm from sklearn.


Solution

    • oversampling or
    • under sampling or
    • over sampling the minority and under sampling the majority

    is a hyperparameter. Do cross validation which ones works best. But use a Training/Test/Validation set.