Search code examples
pythontf-idf

Reversed TF-IDF in Python


Can I reverse or offset the TF-IDF score such that MORE COMMON terms will contribute more to the final score?

I would like to find the most common set of words in the corpus, that isn't unique to any small subset of documents.


Solution

  • I know this is a very old post, but none of the suggestions in the comment section works well.

    The "1/TF-IDF" one only gives you words that are rare throughout documents.

    Remember that TF-IDF not only deprecates prevalent words but also rare words.

    I have recently achieved your goal by using the "tf" and "idf" statics with the following steps:

    1. Within each document, create a new statistic (Rev_tf_idf) by dividing tf with idf. (The original Tf_IDf is tf*idf )
    2. Group by words and sum up Rev_tf_idf within the group.

    I found that words with higher Rev_tf_idf are those that are prevalent throughout all documents in my own data.

    Hope this would work for those who have the same inquiry.