I am using R-tm-Rweka packages to do some text mining. Instead of building a tf-tdm on single words, which is not enough for my purposes, i have to extract ngrams. I used @Ben function TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 3))
tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))
to extract trigrams. The output has an apparent error, see below. It picks up 4-, 3- and 2-word phrases. Ideally, it should have ONLY picked up the 4-word noun phrase and dropped the (3- and 2-word)rest. How do I force this solution, like Python NLTK has a backup tokenizer option?
abstract strategy ->this is incorrect
>
abstract strategy board ->incorrect
abstract strategy board game -> this should be the correct output
accenture executive
accenture executive simple
accenture executive simple comment
Many thanks.
I think you were very close with the attempt that you made. Except that you have to understand that what you were telling Weka
to do was to capture 2-gram and 3-gram tokens; that's just how Weka_control
was specified.
Instead I'd recommend to use the different token sizes in different tokenizers and select or merge the results according to your preference or decision rule.
I think it would be worth checking out this great tutorial on n-gram wordclouds.
A solid code snippet for n-gram text mining is:
# QuadgramTokenizer ####
QuadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4)
for 4-grams,
# TrigramTokenizer ####
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)
For 3-grams, and of course
# BigramTokenizer ####
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)
for 2-grams.
You might be able to avoid your earlier problem by running the different gram sizes separately like this instead of setting Weka_control
to a range.
You can apply the tokenizer like this:
tdm.ng <- TermDocumentMatrix(ds5.1g, control = list(tokenize = BigramTokenizer))
dtm.ng <- DocumentTermMatrix(ds5.1g, control = list(tokenize = BigramTokenizer))
If you still have problems please just provide a reproducible example and I'll follow up.