Search code examples
python-3.xscikit-learntf-idfsklearn-pandastfidfvectorizer

How to Select Top 1000 words using TF-IDF Vector?


I have a Documents with 5000 reviews. I applied tf-idf on that document. Here sample_data contains 5000 reviews. I am applying tf-idf vectorizer on the sample_data with one gram range. Now I want to get the top 1000 words from the sample_data which have highest tf-idf values. Could anyone tell me how to get the top words?

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer(ngram_range=(1,1))
tf_idf_vect.fit(sample_data)
final_tf_idf = tf_idf_vect.transform(sample_data)

Solution

  • TF-IDF values depend on individual documents. You can get top 1000 terms based on their count (Tf) by using the max_features parameter of TfidfVectorizer:

    max_features : int or None, default=None

    If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.
    

    Just do:

    tf_idf_vect = TfidfVectorizer(ngram_range=(1,1), max_features=1000)
    

    You can even get the 'idf' (global term weights) from the tf_idf_vect after fitting (learning) of documents by using idf_ attribute:

    idf_ : array, shape = [n_features], or None

      The learned idf vector (global term weights) when use_idf is set to True,  
    

    Do this after calling tf_idf_vect.fit(sample_data):

    idf = tf_idf_vect.idf_
    

    And then select the top 1000 from them and re-fit the data based on those selected features.

    But you cannot get top 1000 by "tf-idf", because the tf-idf is the product of tf of a term in a single document with idf (global) of the vocabulary. So for same word which appeared 2 times in a single document will have twice the tf-idf than the same word which appeared in another document only once. How can you compare the different values of the same term. Hope this makes it clear.