I have a Documents with 5000 reviews. I applied tf-idf on that document. Here sample_data contains 5000 reviews. I am applying tf-idf vectorizer on the sample_data with one gram range. Now I want to get the top 1000 words from the sample_data which have highest tf-idf values. Could anyone tell me how to get the top words?
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer(ngram_range=(1,1))
tf_idf_vect.fit(sample_data)
final_tf_idf = tf_idf_vect.transform(sample_data)
TF-IDF values depend on individual documents. You can get top 1000 terms based on their count (Tf) by using the max_features
parameter of TfidfVectorizer:
max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
Just do:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,1), max_features=1000)
You can even get the 'idf'
(global term weights) from the tf_idf_vect
after fitting (learning) of documents by using idf_
attribute:
idf_ : array, shape = [n_features], or None
The learned idf vector (global term weights) when use_idf is set to True,
Do this after calling tf_idf_vect.fit(sample_data)
:
idf = tf_idf_vect.idf_
And then select the top 1000 from them and re-fit the data based on those selected features.
But you cannot get top 1000 by "tf-idf", because the tf-idf is the product of tf
of a term in a single document with idf
(global) of the vocabulary. So for same word which appeared 2 times in a single document will have twice the tf-idf than the same word which appeared in another document only once. How can you compare the different values of the same term. Hope this makes it clear.