Search code examples
rrweka

RWeka NgramTokenizer


I've struggled with the RWeka package, specifically with the NGramTokenizer function to make bigrams. From scouring the internet, I've seen one or two other users with the same issue but no solution (that works for me).

Below is an example: 2-gram and 3-gram instead of 1-gram using RWeka

So running:

library(RWeka) 
library(tm)

as.matrix(TermDocumentMatrix(Corpus(VectorSource(c(txt1 = "This is my house",
                                               txt2 = "My house is green"))),
                         list(tokenize = function(x) NGramTokenizer(x, 
                                                                    Weka_control(min=2, 
                                                                                 max=2)),
                              tolower = TRUE)))

I get:

       Docs
Terms   txt1 txt2
  house    1    1
  this     1    0
  green    0    1
  • Note no bigrams, just unigrams (house, this, green).

I've tried it on a volatile corpus with the tokenizer function split out as well as how I learnt from a DataCamp course, but get the below issue instead.

Error in .jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,
:    java.lang.NullPointerException Called from: .jcheck()

There were other work around solutions I saw on the internet that ran fine, but still resulted in unigrams like above.

Running Java 1.8 and R 3.4.3 both 64bit on a 64bit Windows OS.

I tried installing older versions of RWeka, but on trying an old install of tm, it came up with errors, so I couldn't make that work for me (used versions referenced by LukeA in the SO thread linked at the start of this question).


Solution

  • You need to use a VCorpus instead of a Corpus in order to use the NGramTokenizer.

    So if you change your code to:

    as.matrix(TermDocumentMatrix(VCorpus(VectorSource(c(txt1 = "This is my house",
                                                        txt2 = "My house is green"))),
                                 list(tokenize = function(x) NGramTokenizer(x, 
                                                                            Weka_control(min=2, 
                                                                                         max=2)),
                                      tolower = TRUE)))
    

    It will return:

              Docs
    Terms      1 2
      house is 0 1
      is green 0 1
      is my    1 0
      my house 1 1
      this is  1 0