Search code examples
rtext-miningtmrweka

How to implement a backup tokenizer switch in RWeka?


I am using R-tm-Rweka packages to do some text mining. Instead of building a tf-tdm on single words, which is not enough for my purposes, i have to extract ngrams. I used @Ben function TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 3)) tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))
to extract trigrams. The output has an apparent error, see below. It picks up 4-, 3- and 2-word phrases. Ideally, it should have ONLY picked up the 4-word noun phrase and dropped the (3- and 2-word)rest. How do I force this solution, like Python NLTK has a backup tokenizer option?

abstract strategy ->this is incorrect>
abstract strategy board ->incorrect
abstract strategy board game -> this should be the correct output

accenture executive
accenture executive simple
accenture executive simple comment

Many thanks.


Solution

  • I think you were very close with the attempt that you made. Except that you have to understand that what you were telling Weka to do was to capture 2-gram and 3-gram tokens; that's just how Weka_control was specified.

    Instead I'd recommend to use the different token sizes in different tokenizers and select or merge the results according to your preference or decision rule.

    I think it would be worth checking out this great tutorial on n-gram wordclouds.

    A solid code snippet for n-gram text mining is:

    # QuadgramTokenizer ####
    QuadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4)
    

    for 4-grams,

    # TrigramTokenizer ####
    TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)
    

    For 3-grams, and of course

    # BigramTokenizer ####
    BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)
    

    for 2-grams.

    You might be able to avoid your earlier problem by running the different gram sizes separately like this instead of setting Weka_control to a range.

    You can apply the tokenizer like this:

    tdm.ng <- TermDocumentMatrix(ds5.1g, control = list(tokenize = BigramTokenizer))
    dtm.ng <- DocumentTermMatrix(ds5.1g, control = list(tokenize = BigramTokenizer))
    

    If you still have problems please just provide a reproducible example and I'll follow up.