ValuError encounted in SMOTE imblearn.over_sampling

I have been trying to oversample my dataset since it is not balanced. I am doing a binary text classification and would like to keep a ratio of 1 between both my classes. I am trying the SMOTE mechanism to solve the problem.

I followed this tutorial: https://beckernick.github.io/oversampling-modeling/

However, I encounter an error which says:

ValueError: could not convert string to float

Here is my code:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, f1_score
from imblearn.over_sampling import SMOTE

data = pd.read_csv("dataset.csv")

nb_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (1, 10))),
    ('tfidf_transformer', TfidfTransformer()),
    ('classifier', MultinomialNB())
])

k_fold = KFold(n_splits = 10)
nb_f1_scores = []
nb_conf_mat = np.array([[0, 0], [0, 0]])

for train_indices, test_indices in k_fold.split(data):

    train_text = data.iloc[train_indices]['sentence'].values
    train_y = data.iloc[train_indices]['isRelevant'].values

    test_text = data.iloc[test_indices]['sentence'].values
    test_y = data.iloc[test_indices]['isRelevant'].values

    sm = SMOTE(ratio = 1.0)
    train_text_res, train_y_res = sm.fit_sample(train_text, train_y)

    nb_pipeline.fit(train_text, train_y)
    predictions = nb_pipeline.predict(test_text)

    nb_conf_mat += confusion_matrix(test_y, predictions)
    score1 = f1_score(test_y, predictions)
    nb_f1_scores.append(score1)

print("F1 Score: ", sum(nb_f1_scores)/len(nb_f1_scores))
print("Confusion Matrix: ")
print(nb_conf_mat)

Can anyone tell me where I am going wrong, without the two lines of SMOTE, my program works fine.

Solution

You should oversample after vectorizing the text data but before fitting the classifier. This means splitting up the pipeline in the code. The relevant part of the code should be something like this:

nb_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (1, 10))),
    ('tfidf_transformer', TfidfTransformer())
])

k_fold = KFold(n_splits = 10)
nb_f1_scores = []
nb_conf_mat = np.array([[0, 0], [0, 0]])

for train_indices, test_indices in k_fold.split(data):

    train_text = data.iloc[train_indices]['sentence'].values
    train_y = data.iloc[train_indices]['isRelevant'].values

    test_text = data.iloc[test_indices]['sentence'].values
    test_y = data.iloc[test_indices]['isRelevant'].values

    vectorized_text = nb_pipeline.fit_transform(train_text)

    sm = SMOTE(ratio = 1.0)
    train_text_res, train_y_res = sm.fit_sample(vectorized_text, train_y)

    clf = MultinomialNB()
    clf.fit(train_text_res, train_y_res)
    predictions = clf.predict(nb_pipeline.transform(test_text))