My task is to extract keywords from a text. What i did is following:
I'm using the tf-idf "algorithm". For the idf part i'm crawling wikipedia articles and extract the noun phrases (opennlp) and store them in a database.
So when i analyze a text i just have to calculate the tf part and get the idf part from the database.
The results so far are very appealing. My only problem is -> since the texts i have to analyze differ from the wikipedia corpus, some words have a high tf but no idf value (it was not found in the wiki corpus). But sometimes these words are still very important (an example for this could be a new company which is not listed on wikipedia yet).
What should i take as an idf value if it wasn't found in the db(corpus)? (average idf is probably not so a good idea)
How is IDF calculated?
If you have something like IDF = log_e(# of documents / # of documents with term)
you could do log_e(# of documents +1 / 1)
. i.e. treat the document as a new document in the corpus.