I have been trying to oversample my dataset since it is not balanced. I am doing a binary text classification and would like to keep a ratio of 1 between both my classes. I am trying the SMOTE mechanism to solve the problem.
I followed this tutorial: https://beckernick.github.io/oversampling-modeling/
However, I encounter an error which says:
ValueError: could not convert string to float
Here is my code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, f1_score
from imblearn.over_sampling import SMOTE
data = pd.read_csv("dataset.csv")
nb_pipeline = Pipeline([
('vectorizer', CountVectorizer(ngram_range = (1, 10))),
('tfidf_transformer', TfidfTransformer()),
('classifier', MultinomialNB())
])
k_fold = KFold(n_splits = 10)
nb_f1_scores = []
nb_conf_mat = np.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold.split(data):
train_text = data.iloc[train_indices]['sentence'].values
train_y = data.iloc[train_indices]['isRelevant'].values
test_text = data.iloc[test_indices]['sentence'].values
test_y = data.iloc[test_indices]['isRelevant'].values
sm = SMOTE(ratio = 1.0)
train_text_res, train_y_res = sm.fit_sample(train_text, train_y)
nb_pipeline.fit(train_text, train_y)
predictions = nb_pipeline.predict(test_text)
nb_conf_mat += confusion_matrix(test_y, predictions)
score1 = f1_score(test_y, predictions)
nb_f1_scores.append(score1)
print("F1 Score: ", sum(nb_f1_scores)/len(nb_f1_scores))
print("Confusion Matrix: ")
print(nb_conf_mat)
Can anyone tell me where I am going wrong, without the two lines of SMOTE, my program works fine.
You should oversample after vectorizing the text data but before fitting the classifier. This means splitting up the pipeline in the code. The relevant part of the code should be something like this:
nb_pipeline = Pipeline([
('vectorizer', CountVectorizer(ngram_range = (1, 10))),
('tfidf_transformer', TfidfTransformer())
])
k_fold = KFold(n_splits = 10)
nb_f1_scores = []
nb_conf_mat = np.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold.split(data):
train_text = data.iloc[train_indices]['sentence'].values
train_y = data.iloc[train_indices]['isRelevant'].values
test_text = data.iloc[test_indices]['sentence'].values
test_y = data.iloc[test_indices]['isRelevant'].values
vectorized_text = nb_pipeline.fit_transform(train_text)
sm = SMOTE(ratio = 1.0)
train_text_res, train_y_res = sm.fit_sample(vectorized_text, train_y)
clf = MultinomialNB()
clf.fit(train_text_res, train_y_res)
predictions = clf.predict(nb_pipeline.transform(test_text))