machine-learning scikit-learn text-classification word-count

Can i use CountVectorizer on both test and train data at the same time or do I need to split it up?

I currently have an SVM model that classify text into two different classes. I'm currently using CountVectorizer and TfidfTransformer to create my "word vector."

The thing is that I think I maybe do it in the wrong order when I'm doing the conversion of all the text first and then split it up.

My question is, will there be any difference if I do train_test_split first and then do the fit_transform only on the train data and then transform on the test data?

What is the correct way to do it?

Big thanks in advance, happy coding!

count_vect = CountVectorizer(stop_words='english')
X_counts = count_vect.fit_transform(textList)

tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts)

X_train, X_test, y_train, y_test = train_test_split(X_tfidf, correctLabels, test_size=.33, random_state=17)

Solution

First split in train and test set, then fit only on the train set and transform the test set

If you do it the other way around, you are leaking information from the test set to the train set. This might cause overfitting, which will make your model not generalize well to new, unseen data.

The goal of a test set is to test how well your model performs on new data. In the case of Text Analytics, this may mean words it has never seen before and know nothing of the importances of, or new distributions of the occurrence of words. If you first use your CountVectorizer and TfIdfTransformer, you will have no idea of know how it responds to this: after all, all the data has been seen by the transformers. The problem: you think you have built a great model with great performance, but when it is put in production, the accuracy will be much lower.