Search code examples
rtmrweka

n-grams in R error: invalid 'times' argument


I'm trying to follow this example but hit an error.

> library("RWeka")
> library("tm")
Loading required package: NLP
> data("crude")
> BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
> tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
Error in rep(seq_along(x), sapply(tflist, length)) : 
  invalid 'times' argument
In addition: Warning message:
In mclapply(unname(content(x)), termFreq, control) :
  scheduled core 1 encountered error in user code, all values of the job will be affected

Any ideas?


Solution

  • Just use some better/modern package. I can suggest several choices:

    1. Use text2vec instead of tm. See vignettes for examples. (I'm the author).
    2. Worth to check quanteda
    3. If for some reason you like tm, try tokenizers package to replace RWeka ngram tokenizer.