Search code examples
rtokenizetext-miningtidytext

How to do tokenizing by n-gram for pdf file in R


I want to tokenize a pdf document by ngrams in R. I tried to follow the instructions here at https://www.tidytextmining.com/ngrams.html, but get stuck with the unnest_tokens() function.

library(tm)
library(dplyr)
library(tidytext)
library(tidyverse)


filedoc <- "Document2019.pdf"
cname <- file.path(filedoc)
docs <- Corpus(URISource(cname), readerControl=list(reader=readPDF, language = "en")) 

docs_bigrams <- docs %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

I keep getting this error message: Error in UseMethod("unnest_tokens_") : no applicable method for 'unnest_tokens_' applied to an object of class "c('VCorpus', 'Corpus')"

Is there anything I need to do before running the unnest_tokens function? Thank you.


Solution

  • I go with @phiver's suggestion, using tidy function, and repost the answer here so that this thread can be closed/answered.

    "use the tidy function before unnest_tokens. Tidytext uses the tidy function to transform from tm objects to tibbles."

    Thanks!