Search code examples
rtmstemmingsnowball

Not getting the right text after stemming in text analysis (Swedish)


I am having problem with getting the right text after stemming in R. Eg. 'papper' should show as 'papper' but instead shows up as 'papp', 'projekt' becomes 'projek'.

The frequency cloud generated thus shows these shortened versions which loses the actual meaning or becomes incomprehensible.

What can I do to get rid of this problem? I am using the latest version of snowball(0.6.0).

R Code:

library(tm)
library(SnowballC)
text_example <- c("projekt", "papper", "arbete")
stem_doc <- stemDocument(text_example, language="sv")
stem_doc

Expected:
stem_doc
[1] "projekt" "papper"   "arbete" 

Actual:
stem_doc
[1] "projek" "papp"   "arbet"

Solution

  • What you describe here is actually not stemming but is called lemmatization (see @Newl's link for the difference).

    To get the correct lemmas, you can use the R package UDPipe, which is a wrapper around the UDPipe C++ library.

    Here is a quick example of how you would do what you want:

    # install.packages("udpipe")    
    library(udpipe)
    dl <- udpipe_download_model(language = "swedish-lines")
    #> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.3/master/inst/udpipe-ud-2.3-181115/swedish-lines-ud-2.3-181115.udpipe to C:/Users/Johannes Gruber/AppData/Local/Temp/RtmpMhaF8L/reprex8e40d80ef3/swedish-lines-ud-2.3-181115.udpipe
    
    udmodel_swed <- udpipe_load_model(file = dl$file_model)
    
    text_example <- c("projekt", "papper", "arbete")
    
    x <- udpipe_annotate(udmodel_swed, x = text_example)
    x <- as.data.frame(x)
    x$lemma
    #> [1] "projekt" "papper"  "arbete"