Search code examples
scikit-learnnlplogistic-regressiontf-idf

Why is this TF-IDF sentiment analysis classifier performing so well?


Jupter Notebook

The last confusion matrix is for the test set. Is this a case of overfitting with logistic regression? Because even when not pre-processing the text much (including emoticons, punctuation) the accuracy is still very good. Good anyone give some help/advice?


Solution

  • You are performing the TfidfVectorizer on whole data before train_test_split which may be a reason for increased performance due to "data leakage". Since the TfidfVectorizer is learning the vocabulary on your whole data, it is:

    • including words in vocabulary that are not present in train and only present in test (out-of-bag words)
    • adjusting the tf-idf scores based on data from test words also

    Try the following:

    tweets_train, tweets_test, y_train, y_test = train_test_split(reviews['text'].tolist(), 
                                                      reviews['airline_sentiment'], 
                                                      test_size=0.3, 
                                                      random_state=42)
    
    X_train = v.fit_transform(tweets_train)
    X_test = v.transform(tweets_test)
    

    And then check the performance.

    Note: This may not be the only reason for the performance. Or maybe the dataset is such that simple tf-idf works well for it.