python nlp data-science text-processing train-test-split

Do you have to clean your test data before feeding into an NLP model?

This is a natural language processing related question.

Suppose I have a labelled train and unlabelled test set. After I have cleaned my train data(stopword, stem, punctuations etc), I use this cleaned data to build my model.

When fitting it on my test data, will I also have to clean the test data text using the same manner as I did with my train set? or should I not touch the test data completly.

Thanks!

Solution

Yes, you should do the same exact preprocessing on your training and testing dataset.