python-3.x scikit-learn tf-idf sklearn-pandas tfidfvectorizer

How to Select Top 1000 words using TF-IDF Vector?

I have a Documents with 5000 reviews. I applied tf-idf on that document. Here sample_data contains 5000 reviews. I am applying tf-idf vectorizer on the sample_data with one gram range. Now I want to get the top 1000 words from the sample_data which have highest tf-idf values. Could anyone tell me how to get the top words?

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer(ngram_range=(1,1))
tf_idf_vect.fit(sample_data)
final_tf_idf = tf_idf_vect.transform(sample_data)

Solution

TF-IDF values depend on individual documents. You can get top 1000 terms based on their count (Tf) by using the max_features parameter of TfidfVectorizer:

max_features : int or None, default=None

If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.

Just do:

tf_idf_vect = TfidfVectorizer(ngram_range=(1,1), max_features=1000)

You can even get the 'idf' (global term weights) from the tf_idf_vect after fitting (learning) of documents by using idf_ attribute:

idf_ : array, shape = [n_features], or None
  The learned idf vector (global term weights) when use_idf is set to True,  

Do this after calling tf_idf_vect.fit(sample_data):

idf = tf_idf_vect.idf_

And then select the top 1000 from them and re-fit the data based on those selected features.

But you cannot get top 1000 by "tf-idf", because the tf-idf is the product of tf of a term in a single document with idf (global) of the vocabulary. So for same word which appeared 2 times in a single document will have twice the tf-idf than the same word which appeared in another document only once. How can you compare the different values of the same term. Hope this makes it clear.