Search code examples
pythonmachine-learningscikit-learnoversampling

Oversampling for text classification in python?


I have a text data frame that I want to classify. But I need to do oversampling first. Please find sample data below:

df=[['I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am not going to class today','I am not going to class today','I am not going to class today','I am not going to class today'],['Positive','Positive','Positive','Positive','Positive','Positive','Positive','Positive','Positive','Positive','Negative','Negative','Negative','Negative']]
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['Features','Class']
df
          Features                       Class
0   I am going to class today       Positive
1   I am going to class today       Positive
2   I am going to class today       Positive
3   I am going to class today       Positive
4   I am going to class today       Positive
5   I am going to class today       Positive
6   I am going to class today       Positive
7   I am going to class today       Positive
8   I am going to class today       Positive
9   I am going to class today       Positive
10  I am not going to class today   Negative
11  I am not going to class today   Negative
12  I am not going to class today   Negative
13  I am not going to class today   Negative

oversample = RandomOverSampler(sampling_strategy='minority')
# fit and apply the transform
X_over, y_over = oversample.fit_resample(df['Features'], df['Class'])
# summarize class distribution
print(Counter(y_over))

But this is not working and giving me ValueError: Expected 2D array, got 1D array instead:. How can I oversample this data?


Solution

  • I found the problem. I needed to reshape my data.

    X_over, y_over = oversample.fit_resample(df['Features'].values.reshape(-1,1), df['Class'])
    

    This is working now.

    Counter({'Positive': 10, 'Negative': 10})