Search code examples
linguisticsnltk

Which word stemmer should I use in nltk?


My goal is to analyze some corpus (twitter for the now) for emotional content. Just today I realized it would make a bit of sense to search for word stems as opposed to having an exhaustive list of emotional word stems. And so I've been exploring nltk.stem only to realize that there are 4 different stemmers. I'd like to ask the stackoverflow linguists whether LancasterStemmer, PorterStemmer, RegexpStemmer, RSLPStemmer, or WordNetStemmer is best preferably with some justification.


Solution

  • RSLP is for portugese. I'm guessing you want english. Regexp would require you to develop your own stemming expressions, so I think that can be ignored as well. The WordnetStemmer requires that you know the part-of-speech for the word, so you'd have to do pos tagging first in order to use it. I've used the porter stemming algorithm and its pretty good, but the lancaster algorithm is newer, so it might be better. You might want to try using a combination of stemmers, where you choose the shortest stem from each stemmer. Anyway, bottom line is that PorterStemmer is a good default choice.