This is a natural language processing related question.
Suppose I have a labelled train and unlabelled test set. After I have cleaned my train data(stopword, stem, punctuations etc), I use this cleaned data to build my model.
When fitting it on my test data, will I also have to clean the test data text using the same manner as I did with my train set? or should I not touch the test data completly.
Thanks!
Yes, you should do the same exact preprocessing on your training and testing dataset.