Search code examples
javalucenestemming

Difference between Lucene stemmers: EnglishStemmer, PorterStemmer, LovinsStemmer


Have anybody compared these stemmers from Lucene (package org.tartarus.snowball.ext): EnglishStemmer, PorterStemmer, LovinsStemmer? What are the strong/weak points of algorithms behind them? When each of them should be used? Or maybe there are some more algorithms available for english words stemming?

Thanks.


Solution

  • The Lovins stemmer is a very old algorithm that is not of much practical use, since the Porter stemmer is much stronger. Based on some quick skimming of the source code, it seems PorterStemmer implements Porter's original (1980) algorithm, while EnglishStemmer implements his updated version, which should be better.

    A stronger stemming algorithm (actually a lemmatizer) is available in the Stanford NLP tools. A Lucene-Stanford NLP by yours truly bridge is available here (API docs).

    See also Manning, Raghavan & Schütze for general info about stemming and lemmatization.