Search code examples
information-retrievaltf-idfldatopic-modelingtop-n

How to find Top n topics for a document


I am using the tf - IDF to rank terms in a document. When terms are arranged in descending order of the tf - IDF, top 'n' terms are most relevant to that document. When we choose a document, top 'n' terms of that document has to be displayed. My question is how to decide the value of 'n'? For example: for a document terms arranged in descending order of the tf - IDF is as follows:

Document 1

  1. president
    1. Obama
    2. Barak
    3. speech
    4. inauguration
    5. come
    6. the
    7. look
    8. again
    9. took

Now when I want to show topics for document 1, I need only top 5 terms, since all others are not relevant or not topics for the document. How do I decide this breaking point of terms in a document? Thanks in advance


Solution

  • In relation to your sample data, there seems to be a problem because 6 to 10 are non-informative stop-words, some of them even stop-words, such as 'the'.

    So, a first step that you should try is to remove stop-words.

    Coming back to your question, there is no best practice for choosing the value of K in a top-K keyword extraction. This varies from one document to another because some documents are more informative (often multi-topical) than others, which means that these documents should have a higher value of K.

    A way to decide on a stopping point is to check the relative differences between the tfidf values between consecutive terms and then stop at a point where this relative difference becomes higher than a threshold, which indicates that there is a big fall in the amount of key informtion that you are outputting.

    Assuming that you have computed a tfidf score for each term and have sorted them in descending order of their values, compute the following before adding every new term

    enter image description here

    If the above expression is true, where delta is a pre-defined threshold, add the new term... because the new term's informativeness measure is close-enough to the ones already in the list. Otherwise stop if the expression is false, i.e. the difference is higher than delta.

    A note: You can play around with different term scoring functions... not just tfidf.