How to normalize TF*IDF or counts in scikit-learn?

I want to check the cosine similarity of two documents having varying length (say one is a one or two liner while other is of 100-200 lines).

I need a way to normalize tfidf or count vectorizer in scikit-learn for this.

Solution

TfidfVectorizer has an attribute norm (see the docs) that deals with this issue. Try, for example, something like this:

vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', norm='l2')

This will normalise the vectors to account for differences in document lengths.