I've struggled with the RWeka package, specifically with the NGramTokenizer function to make bigrams. From scouring the internet, I've seen one or two other users with the same issue but no solution (that works for me).
Below is an example: 2-gram and 3-gram instead of 1-gram using RWeka
So running:
library(RWeka)
library(tm)
as.matrix(TermDocumentMatrix(Corpus(VectorSource(c(txt1 = "This is my house",
txt2 = "My house is green"))),
list(tokenize = function(x) NGramTokenizer(x,
Weka_control(min=2,
max=2)),
tolower = TRUE)))
I get:
Docs
Terms txt1 txt2
house 1 1
this 1 0
green 0 1
I've tried it on a volatile corpus with the tokenizer function split out as well as how I learnt from a DataCamp course, but get the below issue instead.
Error in .jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,
: java.lang.NullPointerException Called from: .jcheck()
There were other work around solutions I saw on the internet that ran fine, but still resulted in unigrams like above.
Running Java 1.8 and R 3.4.3 both 64bit on a 64bit Windows OS.
I tried installing older versions of RWeka, but on trying an old install of tm, it came up with errors, so I couldn't make that work for me (used versions referenced by LukeA in the SO thread linked at the start of this question).
You need to use a VCorpus
instead of a Corpus
in order to use the NGramTokenizer
.
So if you change your code to:
as.matrix(TermDocumentMatrix(VCorpus(VectorSource(c(txt1 = "This is my house",
txt2 = "My house is green"))),
list(tokenize = function(x) NGramTokenizer(x,
Weka_control(min=2,
max=2)),
tolower = TRUE)))
It will return:
Docs
Terms 1 2
house is 0 1
is green 0 1
is my 1 0
my house 1 1
this is 1 0