Search code examples
pythonmachine-learningnlpvalueerrorsmote

How to remove minority classes with less than a certain number of examples before performing SMOTE, python


I have a dataset which contains 100 columns as feature vectors(100D feature vectors) generated from word2vec and my target is a categorical variable for each of the rows of vector in my dataset. Now there are around 1000 different categorical variables in total for my dataset and the number of rows are around 75000. The issue with the dataset is that it is highly imbalanced and except the top 200 categorical variables all the remaining classes have very few samples and some classes have less than 6 samples.

Now I want to perform oversampling on this data using SMOTE to generate more examples for the minority classes. I want to ignore the classes that have less than 6 sample examples because that is the point where SMOTE gives a value error. Is there any way, I can handle it in the code so that, I can ignore those classes with less than 6 samples while performing SMOTE ? And will doing that help in solving the error that I am facing currently?

Code & Error message for reference:

dataset = pd.read_csv(r'C:\vectors.csv')
X = dataset.iloc[:, 3:103]
y = dataset.iloc[:, 0]
from imblearn.over_sampling import SMOTE
smote = SMOTE(k_neighbors = 1)       
smote_Xtrain, smote_y_train = smote.fit_sample(X, y) 

I am getting this error currently ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 2 though I have set k_neighbors = 1

Any help on this will be highly appreciated


Solution

  • You can see Unique entries for each class, and count them, with the following command : df['VARIABLE'].value_counts(dropna=False) (Turn dropna=True if you don't want NaN to appear).

    Then with that, you can yourself create an algorithm, setting a threshold, and automatically removing classes appearing less than your threshold, or putting them in a new big class "Other" for example