How to tokenize my dataset in R using the tidytext library?

I have been trying to follow Text Mining with R by Julia Silge, however, I cannot tokenize my dataset with the unnest_tokens function.

Here are the packages I have loaded:

# Load
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(corpus)
library(corpustools)
library(dplyr)
library(tidyverse)
library(tidytext)
library(tokenizers)
library(stringr)

Here is the dataset I tried to use which is online, so the results should be reproducible:

bible <- readLines('http://bereanbible.com/bsb.txt')

And here is where everything falls apart.

Input:

 bible <- bible %>%
      unnest_tokens(word, text)

Output:

Error in tbl[[input]] : subscript out of bounds

From what I have read about this error, in Rstudio, the issue is that the dataset needs to be a matrix, so I tried transforming the dataset into a matrix table and I received the same error message.

Input:

 bible <- readLines('http://bereanbible.com/bsb.txt')


bible <- as.matrix(bible, nrow = 31105, ncol = 2 )
      
bible <- bible %>%
  unnest_tokens(word, text)

Output:

Error in tbl[[input]] : subscript out of bounds

Any recommendations for what next steps I could take or maybe some good Text mining sources I could use as I continue to dive into this would be very much appreciated.

Solution

The problem is that readLines()creates a vector, not a dataframe, as expected by unnest_tokens(), so you need to convert it. It is also helpful to separate the verse to it's own column:

library(tidytext)
library(tidyverse)

bible_orig <- readLines('http://bereanbible.com/bsb.txt')

# Get rid of the copyright etc.
bible_orig <- bible_orig[4:length(bible_orig)]

# Convert to df
bible <- enframe(bible_orig)

# Separate verse from text
bible <- bible %>% 
 separate(value, into = c("verse", "text"), sep = "\t")

tidy_bible <- bible %>% 
  unnest_tokens(word, text)

tidy_bible
#> # A tibble: 730,130 x 3
#>     name verse       word     
#>    <int> <chr>       <chr>    
#>  1     1 Genesis 1:1 in       
#>  2     1 Genesis 1:1 the      
#>  3     1 Genesis 1:1 beginning
#>  4     1 Genesis 1:1 god      
#>  5     1 Genesis 1:1 created  
#>  6     1 Genesis 1:1 the      
#>  7     1 Genesis 1:1 heavens  
#>  8     1 Genesis 1:1 and      
#>  9     1 Genesis 1:1 the      
#> 10     1 Genesis 1:1 earth    
#> # … with 730,120 more rows

^{Created on 2020-07-14 by the reprex package (v0.3.0)}