I want to tokenize a pdf document by ngrams in R.
I tried to follow the instructions here
at https://www.tidytextmining.com/ngrams.html,
but get stuck with the unnest_tokens()
function.
library(tm)
library(dplyr)
library(tidytext)
library(tidyverse)
filedoc <- "Document2019.pdf"
cname <- file.path(filedoc)
docs <- Corpus(URISource(cname), readerControl=list(reader=readPDF, language = "en"))
docs_bigrams <- docs %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
I keep getting this error message:
Error in UseMethod("unnest_tokens_") : no applicable method for 'unnest_tokens_' applied to an object of class "c('VCorpus', 'Corpus')"
Is there anything I need to do before running the unnest_tokens function? Thank you.
I go with @phiver's suggestion, using tidy function, and repost the answer here so that this thread can be closed/answered.
"use the tidy function before unnest_tokens. Tidytext uses the tidy function to transform from tm objects to tibbles."
Thanks!