Search code examples
pythonpython-3.xscikit-learncountvectorizer

Should CountVectorizer be fit on both the train and test sets?


I have come across various articles online, some of which suggest that CountVectorizer should be fit on both the train and test sets, and some suggest that it should be fit only on the train set. Which approach is generally better for text classification?


Solution

  • Generally the test_set should be kept unobserved, so the CountVectorizer should be only fitted on train_set