I have come across various articles online, some of which suggest that CountVectorizer should be fit on both the train and test sets, and some suggest that it should be fit only on the train set. Which approach is generally better for text classification?
Generally the test_set
should be kept unobserved, so the CountVectorizer
should be only fitted on train_set