scikit-learn nlp logistic-regression tf-idf

Why is this TF-IDF sentiment analysis classifier performing so well?

The last confusion matrix is for the test set. Is this a case of overfitting with logistic regression? Because even when not pre-processing the text much (including emoticons, punctuation) the accuracy is still very good. Good anyone give some help/advice?

Solution

You are performing the TfidfVectorizer on whole data before train_test_split which may be a reason for increased performance due to "data leakage". Since the TfidfVectorizer is learning the vocabulary on your whole data, it is:

including words in vocabulary that are not present in train and only present in test (out-of-bag words)
adjusting the tf-idf scores based on data from test words also

Try the following:

tweets_train, tweets_test, y_train, y_test = train_test_split(reviews['text'].tolist(), 
                                                  reviews['airline_sentiment'], 
                                                  test_size=0.3, 
                                                  random_state=42)

X_train = v.fit_transform(tweets_train)
X_test = v.transform(tweets_test)

And then check the performance.

Note: This may not be the only reason for the performance. Or maybe the dataset is such that simple tf-idf works well for it.