The data I have has an imbalance.
It is about 45000 : 1500 imbalance, but when oversampling, smote, and smotetomek are used, more than 97% of the results are obtained.
However, when the test was actually performed, all 1500 cases had opposite results.
When undersampling is performed, it has an accuracy of about 85%, Of the 1,500 cases, 300 are different but the difference in accuracy is large
Of course, I checked the recall and precision, but there was no significant difference from the accuracy, so could you explain to me why these results occurred?
sm = SMOTE(random_state=42)
X, y = sm.fit_resample(X, y)
got this confusion matrix
accuracy: 0.9784
precision: 0.9718
recall: 0.9854
F1: 0.9786
AUC: 0.9784
But acutally when I run the test containing class 1 data got this result
test_result = model.predict(test_x)
pd.DataFrame(test_result).value_counts()
0 1589
dtype: int64
All class 1 predict 0,,,
Also equal when use oversampling, smote, smotetomek
Undersampling
Accuracy : 86%
0 354
1 1235
dtype: int64
I don't know what case is best? Is there anything else I can try?
Added I find and referenced imbalanced-learn.org
from sklearn.metrics import balanced_accuracy_score
balanced_accuracy_score(y_test, y_pred)
Should I check this accuracy score? If I have the imbalanced data set?
When I check the Accuracy evaluation about the oversampling and undersampling
oversampling : 50%
undersampling : 84%
For who don't know the answer
Random Oversampling is problem Because it is a random method of inserting data from a minority class repeatedly, overfitting problems can occur due to repeated data insertion.
And, when the test data were put in, of course, a different value was obtained.