Search code examples
pythonnlpdata-sciencetext-processingtrain-test-split

Do you have to clean your test data before feeding into an NLP model?


This is a natural language processing related question.

Suppose I have a labelled train and unlabelled test set. After I have cleaned my train data(stopword, stem, punctuations etc), I use this cleaned data to build my model.

When fitting it on my test data, will I also have to clean the test data text using the same manner as I did with my train set? or should I not touch the test data completly.

Thanks!


Solution

  • Yes, you should do the same exact preprocessing on your training and testing dataset.