Search code examples
rtext-miningstemming

Word stemming in R


I am working on a text mining project and trying to clean the text - words in singular/plural forms, verbs in different tenses and misspelling words. My sample looks like this:

test <- c("apple","apples","wife","wives","win","won","winning","winner","orange","oranges","orenge")

I tried to use the wordStem function in SnowballC package. However the results are wrong:

"appl"   "appl"   "wife"   "wive"   "win"    "won"    "win"    "winner" "orang"  "orang"  "oreng" 

What I would like to see is:

"apple"   "apple"   "wife"   "wife"   "win"    "win"    "win"    "winner" "orange"  "orange"  "orange"

Solution

  • That is just how the Porter Stemmer works. The reason for this is that it allows fairly simple rules to create the stems without having to store a large English vocabulary. For example, I think that you would not like that both change and changing go to chang. It seems more natural that they should both stem to change. So would you make a rule that if you take ing off the end of a word, you should add back e to get the stem? Then what would happen with clang and clanging? The Porter Stemmer gives clang. Adding e would give the non-word clange. Either you use simple processing rules that sometimes create stems that are not words, or you must include a large vocabulary and have more complex rules that depend on what the words are. The Porter Stemmer uses the simple rules method.