python scikit-learn nlp tf-idf tfidfvectorizer

Is there a way to get only the IDF values of words using scikit or any other python package?

I have a text column in my dataset and using that column I want to have a IDF calculated for all the words that are present. TFID implementations in scikit, like tfidf vectorize, are giving me TFIDF values directly as against just word IDFs. Is there a way to get word IDFs give a set of documents?

Solution

You can just use TfidfVectorizer with use_idf=True (default value) and then extract with idf_.

from sklearn.feature_extraction.text import TfidfVectorizer

my_data = ["hello how are you", "hello who are you", "i am not you"]

tf = TfidfVectorizer(use_idf=True)
tf.fit_transform(my_data)

idf = tf.idf_

[BONUS] if you want to get the idf value for a particular word:

# If you want to get the idf value for a particular word, here "hello"    
tf.idf_[tf.vocabulary_["hello"]]