Search code examples
rtexttmmining

Custom tokenizer in tm package R not working


please see MWE below, the custom defined tokenizer is not working, why? tm package version is 0.71

library(tm)

ts <- c("This is a testimonial")
corpDs <- Corpus(VectorSource(ts))

#This is not working
ownTokenizer <- function(x) unlist(strsplit(as.character(x), "i+"))
tdm <- DocumentTermMatrix(corpDs,control=list(tokenize=ownTokenizer))
as.matrix(tdm)

#This is working
ownTokenizer(ts)

Output:

Terms

Docs testimonial this

1 1 1

[1] "Th" "s " "s a test" "mon" "al"

Thank you,

Tobias


Solution

  • I know this is somewhat stale now, but maybe it still helps others: You have to replace corpDS<-Corpus(...) by corpDS<-VCorpus(...) As tm documentation states in the TermDocumentMatrix description, "SimpleCorpus" corpora are always tokenized with a fixed tokenizer - no costumization - it seems to be the same for "Corpus"...