Search code examples
scikit-learnnlptf-idfcountvectorizer

How to normalize TF*IDF or counts in scikit-learn?


I want to check the cosine similarity of two documents having varying length (say one is a one or two liner while other is of 100-200 lines).

I need a way to normalize tfidf or count vectorizer in scikit-learn for this.


Solution

  • TfidfVectorizer has an attribute norm (see the docs) that deals with this issue. Try, for example, something like this:

    vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', norm='l2')
    

    This will normalise the vectors to account for differences in document lengths.