Search code examples
javanlpopennlptf-idf

Idf score for an unknown word?


My task is to extract keywords from a text. What i did is following:

I'm using the tf-idf "algorithm". For the idf part i'm crawling wikipedia articles and extract the noun phrases (opennlp) and store them in a database.

So when i analyze a text i just have to calculate the tf part and get the idf part from the database.

The results so far are very appealing. My only problem is -> since the texts i have to analyze differ from the wikipedia corpus, some words have a high tf but no idf value (it was not found in the wiki corpus). But sometimes these words are still very important (an example for this could be a new company which is not listed on wikipedia yet).

What should i take as an idf value if it wasn't found in the db(corpus)? (average idf is probably not so a good idea)


Solution

  • How is IDF calculated?

    If you have something like IDF = log_e(# of documents / # of documents with term) you could do log_e(# of documents +1 / 1). i.e. treat the document as a new document in the corpus.