Search code examples
nlptext-classificationindicesoversamplingimbalanced-data

Retrieve the indices for only the resampled instances after oversampling using imbalanced-learn?


For a binary text classification problem with imbalanced data, I use imbalanced-learn library's function RandomOverSampler to balance the classes.

Now, I want to retrieve only the instances that were oversampled (replicated) from the original data. For example, if "item_1" is the original data and item 2 to 4 are the replicas of "item_1", I require only the indices for "item_2", "item_3", "item_4" for further processing and leave out the index for "item_1".

  1. item_1
  2. item_2
  3. item_3
  4. item_4

Here goes the my code:

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)

X_listed = []
for eachTrainInstance in X_train:
    X_listed.append([eachTrainInstance])

X_tr_resampled, y_tr_resampled = ros.fit_sample(X_listed, y_train)

Solution

  • It seems that all the oversampled instances (and, of course, their corresponding indices) are concatenated at the end of original data subjected to oversampling.

    oversampled_instances = y_tr_resampled[len(y_train):]