Search code examples
pythonscikit-learnnlptf-idftfidfvectorizer

Is there a way to get only the IDF values of words using scikit or any other python package?


I have a text column in my dataset and using that column I want to have a IDF calculated for all the words that are present. TFID implementations in scikit, like tfidf vectorize, are giving me TFIDF values directly as against just word IDFs. Is there a way to get word IDFs give a set of documents?


Solution

  • You can just use TfidfVectorizer with use_idf=True (default value) and then extract with idf_.

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    my_data = ["hello how are you", "hello who are you", "i am not you"]
    
    tf = TfidfVectorizer(use_idf=True)
    tf.fit_transform(my_data)
    
    idf = tf.idf_ 
    

    [BONUS] if you want to get the idf value for a particular word:

    # If you want to get the idf value for a particular word, here "hello"    
    tf.idf_[tf.vocabulary_["hello"]]