I am having problem with getting the right text after stemming in R. Eg. 'papper' should show as 'papper' but instead shows up as 'papp', 'projekt' becomes 'projek'.
The frequency cloud generated thus shows these shortened versions which loses the actual meaning or becomes incomprehensible.
What can I do to get rid of this problem? I am using the latest version of snowball(0.6.0).
R Code:
library(tm)
library(SnowballC)
text_example <- c("projekt", "papper", "arbete")
stem_doc <- stemDocument(text_example, language="sv")
stem_doc
Expected:
stem_doc
[1] "projekt" "papper" "arbete"
Actual:
stem_doc
[1] "projek" "papp" "arbet"
What you describe here is actually not stemming but is called lemmatization (see @Newl's link for the difference).
To get the correct lemmas, you can use the R
package UDPipe
, which is a wrapper around the UDPipe C++ library.
Here is a quick example of how you would do what you want:
# install.packages("udpipe")
library(udpipe)
dl <- udpipe_download_model(language = "swedish-lines")
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.3/master/inst/udpipe-ud-2.3-181115/swedish-lines-ud-2.3-181115.udpipe to C:/Users/Johannes Gruber/AppData/Local/Temp/RtmpMhaF8L/reprex8e40d80ef3/swedish-lines-ud-2.3-181115.udpipe
udmodel_swed <- udpipe_load_model(file = dl$file_model)
text_example <- c("projekt", "papper", "arbete")
x <- udpipe_annotate(udmodel_swed, x = text_example)
x <- as.data.frame(x)
x$lemma
#> [1] "projekt" "papper" "arbete"