Search code examples
pythonmachine-learningscikit-learnsamplingimbalanced-data

Which should I use, oversampling or undersampling?


The data I have has an imbalance.

It is about 45000 : 1500 imbalance, but when oversampling, smote, and smotetomek are used, more than 97% of the results are obtained.

However, when the test was actually performed, all 1500 cases had opposite results.

When undersampling is performed, it has an accuracy of about 85%, Of the 1,500 cases, 300 are different but the difference in accuracy is large

Of course, I checked the recall and precision, but there was no significant difference from the accuracy, so could you explain to me why these results occurred?

sm = SMOTE(random_state=42)
X, y = sm.fit_resample(X, y) 

enter image description here

got this confusion matrix

accuracy: 0.9784
precision: 0.9718
recall: 0.9854
F1: 0.9786
AUC: 0.9784

But acutally when I run the test containing class 1 data got this result

test_result = model.predict(test_x)
pd.DataFrame(test_result).value_counts()

0 1589
dtype: int64

All class 1 predict 0,,,
Also equal when use oversampling, smote, smotetomek

Undersampling

Accuracy : 86%

0 354
1 1235
dtype: int64

I don't know what case is best? Is there anything else I can try?

Added I find and referenced imbalanced-learn.org

from sklearn.metrics import balanced_accuracy_score
balanced_accuracy_score(y_test, y_pred)  

Should I check this accuracy score? If I have the imbalanced data set?

When I check the Accuracy evaluation about the oversampling and undersampling

oversampling : 50%
undersampling : 84%


Solution

  • For who don't know the answer

    Random Oversampling is problem Because it is a random method of inserting data from a minority class repeatedly, overfitting problems can occur due to repeated data insertion.

    And, when the test data were put in, of course, a different value was obtained.