The last confusion matrix is for the test set. Is this a case of overfitting with logistic regression? Because even when not pre-processing the text much (including emoticons, punctuation) the accuracy is still very good. Good anyone give some help/advice?
You are performing the TfidfVectorizer
on whole data before train_test_split
which may be a reason for increased performance due to "data leakage". Since the TfidfVectorizer
is learning the vocabulary on your whole data, it is:
out-of-bag
words)tf-idf
scores based on data from test words alsoTry the following:
tweets_train, tweets_test, y_train, y_test = train_test_split(reviews['text'].tolist(),
reviews['airline_sentiment'],
test_size=0.3,
random_state=42)
X_train = v.fit_transform(tweets_train)
X_test = v.transform(tweets_test)
And then check the performance.
Note: This may not be the only reason for the performance. Or maybe the dataset is such that simple tf-idf works well for it.