I am working on a text mining project and trying to clean the text - words in singular/plural forms, verbs in different tenses and misspelling words. My sample looks like this:
test <- c("apple","apples","wife","wives","win","won","winning","winner","orange","oranges","orenge")
I tried to use the wordStem function in SnowballC package. However the results are wrong:
"appl" "appl" "wife" "wive" "win" "won" "win" "winner" "orang" "orang" "oreng"
What I would like to see is:
"apple" "apple" "wife" "wife" "win" "win" "win" "winner" "orange" "orange" "orange"
That is just how the Porter Stemmer works. The reason for this is that it allows fairly simple rules to create the stems without having to store a large English vocabulary. For example, I think that you would not like that both change
and changing
go to chang
. It seems more natural that they should both stem to change
. So would you make a rule that if you take ing
off the end of a word, you should add back e
to get the stem? Then what would happen with clang
and clanging
? The Porter Stemmer gives clang
. Adding e
would give the non-word clange
. Either you use simple processing rules that sometimes create stems that are not words, or you must include a large vocabulary and have more complex rules that depend on what the words are. The Porter Stemmer uses the simple rules method.