For a binary text classification problem with imbalanced data, I use imbalanced-learn library's function RandomOverSampler
to balance the classes.
Now, I want to retrieve only the instances that were oversampled (replicated) from the original data. For example, if "item_1" is the original data and item 2 to 4 are the replicas of "item_1", I require only the indices for "item_2", "item_3", "item_4" for further processing and leave out the index for "item_1".
Here goes the my code:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_listed = []
for eachTrainInstance in X_train:
X_listed.append([eachTrainInstance])
X_tr_resampled, y_tr_resampled = ros.fit_sample(X_listed, y_train)
It seems that all the oversampled instances (and, of course, their corresponding indices) are concatenated at the end of original data subjected to oversampling.
oversampled_instances = y_tr_resampled[len(y_train):]