How to implement a backup tokenizer switch in RWeka?

I am using R-tm-Rweka packages to do some text mining. Instead of building a tf-tdm on single words, which is not enough for my purposes, i have to extract ngrams. I used @Ben function TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 3)) tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))
to extract trigrams. The output has an apparent error, see below. It picks up 4-, 3- and 2-word phrases. Ideally, it should have ONLY picked up the 4-word noun phrase and dropped the (3- and 2-word)rest. How do I force this solution, like Python NLTK has a backup tokenizer option?

abstract strategy ->this is incorrect>
abstract strategy board ->incorrect
abstract strategy board game -> this should be the correct output

accenture executive
accenture executive simple
accenture executive simple comment

Many thanks.

Solution

I think you were very close with the attempt that you made. Except that you have to understand that what you were telling Weka to do was to capture 2-gram and 3-gram tokens; that's just how Weka_control was specified.

Instead I'd recommend to use the different token sizes in different tokenizers and select or merge the results according to your preference or decision rule.

I think it would be worth checking out this great tutorial on n-gram wordclouds.

A solid code snippet for n-gram text mining is:

# QuadgramTokenizer ####
QuadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4)

for 4-grams,

# TrigramTokenizer ####
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)

For 3-grams, and of course

# BigramTokenizer ####
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)

for 2-grams.

You might be able to avoid your earlier problem by running the different gram sizes separately like this instead of setting Weka_control to a range.

You can apply the tokenizer like this:

tdm.ng <- TermDocumentMatrix(ds5.1g, control = list(tokenize = BigramTokenizer))
dtm.ng <- DocumentTermMatrix(ds5.1g, control = list(tokenize = BigramTokenizer))

If you still have problems please just provide a reproducible example and I'll follow up.