Search code examples
pythonnlptext-miningsentiment-analysis

text mining preprocessing must be applied to test or to train set?


I'm doing some text-mining tasks and I have such a simple question and I still can't reach a conclusion.

I am applying pre-processing, such as tokenization and stemming to my training set so i can train my model.

Should I also apply this pre-processing to my test set?


Solution

  • Yes, you should apply same things to your test set. Because you test set must represent your train set, that's why they should be from same distribution. Let's think intuitively:

    You will enter an exam. In order you to prepare for exam and get a normal result, lecturer should ask from same subjects in the lectures. But if the lecturer ask questions from a totally different subjects that no one has seen, it is not possible to get a normal result.