How can I replace the porter-based stemmer in R package tm with one, that better suits my needs? In this case it's cistem (https://github.com/FlorianSchwendinger/cistem/). Cistem however takes single words (or a vector) as an argument:
install_github("FlorianSchwendinger/cistem")
library("cistem")
> cistem("arbeiten")
[1] "arbei"
> cistem(c("arbeiten", "Arbeit"))
[1] "arbei" "arbeit"
whereas the built-in stemmer takes a whole document
corpus <- tm_map(corpus, stemDocument, language = "german")
How do I integrate the CISTEM stemmer within the tm package?
Any help is appreciated.
You can integrate other functions with content_transformer
, which you can then use in a tm_map
call. You just need to know what the receiving function needs. In this case cistem
needs the words so you can use the words
function from the NLP package to get there (automatically loaded when you load the tm library). Also an unlist
and lapply
are needed.
* Note: *
cistem
returns the words in lowercase, so be aware of this fact.
library(cistem)
library(tm)
# Some text
txt <- c("Dies ist ein deutscher Text.",
"Dies ist ein anderer deutscher Text.")
# the stemmer based on cistem
my_stemmer <- content_transformer(function(x) {
unlist(lapply(x, function(line) { # unlist the corpus and lapply over the list
paste(cistem(words(line)), collapse = " ")) # paste the words back together.
}
)
})
my_corpus <- VCorpus(VectorSource(txt))
# stem the corpus
my_stemmed_corpus <- tm_map(my_corpus, my_stemmer)
# check output
inspect(my_stemmed_corpus[[1]])
<<PlainTextDocument>>
Metadata: 7
Content: chars: 26
dies ist ein deutsch text.
inspect(my_stemmed_corpus[[2]])
<<PlainTextDocument>>
Metadata: 7
Content: chars: 32
dies ist ein ander deutsch text.