Search code examples
rtmstemmingsnowball

problems in stemming in text analysis (Swedish data)


In the following codes, my aim is to reduce the number of words with the same stem. For example, kompis in Swedish refer a friend in English, and the words with similar roots are kompisar, kompiserna.

rm(list=ls())
Sys.setlocale("LC_ALL","sv_SE.UTF-8")
library(tm)
library(SnowballC)
kompis <- c("kompisar", "kompis", "kompiserna")
stem_doc <- stemDocument(kompis, language="swedish")
stem_doc
1] "kompis" "kompis" "kompis"

I create a sample text file including the word kompis, kompisar, kompiserna. Then, I did some preproceses in the corpus via following codes:

        text <-  c("TV och vara med kompisar.",
               "Jobba på kompis huset",
               "Ta det lugnt, umgås med kompisar.",
               "Umgås med kompisar, vänner ",
               "kolla anime med kompiserna")
corpus.prep <- Corpus(VectorSource(text), readerControl    =list(reader=readPlain, language="swe"))
corpus.prep <- tm_map(corpus.prep, PlainTextDocument)
corpus.prep <- tm_map(corpus.prep, stemDocument,language = "swedish")
head(content(corpus.prep[[1]]))

The results as follows. However, it includes the original words rather than same stem: kompis.

1] "TV och vara med kompisar."       
2] "Jobba på kompi huset"            
3] "Ta det lugnt, umgå med kompisar."
4] "Umgås med kompisar, vänner"      
5] "kolla anim med kompiserna"   

Do you know how to fix it?


Solution

  • You are almost there, but using PlainTextDocument is interfering with your goal.

    The following code will return your expected result. I'm using remove punctuation otherwise the stemming will not work on the works that are at the end of the sentence. Also you will see warning messages appearing after both tm_map calls. You can ignore these.

    corpus.prep <- Corpus(VectorSource(text), readerControl    =list(reader=readPlain, language="swe"))
    corpus.prep <- tm_map(corpus.prep, removePunctuation)
    corpus.prep <- tm_map(corpus.prep, stemDocument, language = "swedish")
    
    head(content(corpus.prep))
    
    [1] "TV och var med kompis"         "Jobb på kompis huset"          "Ta det lugnt umgås med kompis" "Umgås med kompis vänn"        
    [5] "koll anim med kompis"   
    

    For this kind of work I tend to use quanteda. Better support and works a lot better than tm.

    library(quanteda)
    
    # remove_punct not really needed as quanteda treats the "." as a separate token.
    my_dfm <- dfm(text, remove_punct = TRUE) 
    dfm_wordstem(my_dfm, language = "swedish")
    
    Document-feature matrix of: 5 documents, 15 features (69.3% sparse).
    5 x 15 sparse Matrix of class "dfm"
           features
    docs    tv och var med kompis jobb på huset ta det lugnt umgås vänn koll anim
      text1  1   1   1   1      1    0  0     0  0   0     0     0    0    0    0
      text2  0   0   0   0      1    1  1     1  0   0     0     0    0    0    0
      text3  0   0   0   1      1    0  0     0  1   1     1     1    0    0    0
      text4  0   0   0   1      1    0  0     0  0   0     0     1    1    0    0
      text5  0   0   0   1      1    0  0     0  0   0     0     0    0    1    1