Search code examples
rtmstemming

Ho to use a custom stemming algorithm with tm package in R?


How can I replace the porter-based stemmer in R package tm with one, that better suits my needs? In this case it's cistem (https://github.com/FlorianSchwendinger/cistem/). Cistem however takes single words (or a vector) as an argument:

install_github("FlorianSchwendinger/cistem")
library("cistem")
> cistem("arbeiten")
[1] "arbei"
> cistem(c("arbeiten", "Arbeit"))
[1] "arbei"  "arbeit"

whereas the built-in stemmer takes a whole document

corpus <- tm_map(corpus, stemDocument, language = "german") 

How do I integrate the CISTEM stemmer within the tm package?

Any help is appreciated.


Solution

  • You can integrate other functions with content_transformer, which you can then use in a tm_map call. You just need to know what the receiving function needs. In this case cistem needs the words so you can use the words function from the NLP package to get there (automatically loaded when you load the tm library). Also an unlist and lapply are needed.

    * Note: * cistem returns the words in lowercase, so be aware of this fact.

    library(cistem)
    library(tm)
    
    # Some text
    txt <- c("Dies ist ein deutscher Text.", 
      "Dies ist ein anderer deutscher Text.")
    
    # the stemmer based on cistem
    my_stemmer <- content_transformer(function(x) {
      unlist(lapply(x, function(line) {    # unlist the corpus and lapply over the list
        paste(cistem(words(line)), collapse = " "))  # paste the words back together.
        }
        )
      })
    
    my_corpus <- VCorpus(VectorSource(txt))
    
    # stem the corpus
    my_stemmed_corpus <- tm_map(my_corpus, my_stemmer)
    
    
    # check output
    inspect(my_stemmed_corpus[[1]])
    <<PlainTextDocument>>
    Metadata:  7
    Content:  chars: 26
    
    dies ist ein deutsch text.
    
    inspect(my_stemmed_corpus[[2]])  
    <<PlainTextDocument>>
    Metadata:  7
    Content:  chars: 32
    
    dies ist ein ander deutsch text.