Search code examples
lucenewekastop-wordssnowball

Remove common english words strategy


I want to extract relevant keywords from a html page.

I already stipped all html stuff, split the text into words, used a stemmer and removed all words appearing in a stop word list from lucene.

But now I still have alot of basic verbs and pronouns as most common words.

Is there some method or set of words in lucene or snowball or anywhere else to filter out all these things like "I, is , go, went, am, it, were, we, you, us,...."


Solution

  • It seems like a pretty simple application of inverse document frequency. If you had even a small corpus of say, 10,000 web pages, you could compute the probability of each word appearing in a document. Then pick a threshold where you think the words start to get interesting or contentful and exclude the words before that threshold.

    Alternatively, this list looks good. http://www.lextek.com/manuals/onix/stopwords1.html