data-science text-classification tfidfvectorizer

Excellent performances on training test, bad on test set

I'm doing text classification and I'm dealing with weird results. I have two datasets, one labeled and the other one unlabeled. When I use some classifiers (SVM, Naive Bayes, knn, Random Forest, Gradient Boosting) on the labeled one I have excellent performances, even without tuning, with all the classifiers (more than 98% of BAC), but when I try to predict results on the unlabeled dataset I have very different predictions for every classifier. I used TF-IDF as vectorizer and I tried to use also bigrams and trigrams but nothing has changed. I tried also to create different new observations using SMOTE (even if I don't have problems with imbalanced dataset) just in order to see if, with new observations, algorithms would generalize better with new data but, even in this case, nothing has changed. What can I do in order to resolve this problem? Why is this happening? Do you have any idea?

Solution

Hi and welcome to the forum!

I can think of 3 possibilities (which are in fact slightly overlapping):

Are you splitting the labeled dataset in a training and a validation set? Maybe you are suffering from the horrible-sounding data leakage. Basically, data from the validation set is somehow leaking into the training data, so the model knows more than it should. It's more common than you think.
Maybe you are overfitting the training set. Basically, the model memorizes the training data and doesn't generalize very well. You can try stopping the training at an earlier stage.
The ditribution of the training data and the test data are not similar enough to the model to generalize well. You can try reshuffling them and separating again. A basic thing you can also try to understand the similarity of the datasets is check the distribution of classes among the training and test data, but more complex and useful techniques exist.

Some of these more complex techniques to compare the training and the test data are:

Checking whether a classifier can correctly decide whether a datapoint belongs to the train or the test set. This shouldn't be possible if the distributions were similar enough. Here's a tutorial in Python.
Using the Kolmogorov-Smirnov test (another tutorial in Python). scipy.stats implements it: see stats.ks_2samp. Beware: this test must be applied to each column separately, so it doesn't work if you are working with, for example, NLP's word embeddings.
If you are indeed working with word embeddings, you should use the classifier described in the first bullet point or transform your data so it is unidimensional. A simple example is to compute the norm of the word embeddings, but that doesn't quite do the job. The Mahalonobis (also implented by SciPy) distance works a bit better - see the bottom of last bullet point's tutorial for details.