Search code examples
rtexttext-miningtidytext

How to tokenize my dataset in R using the tidytext library?


I have been trying to follow Text Mining with R by Julia Silge, however, I cannot tokenize my dataset with the unnest_tokens function.

Here are the packages I have loaded:

# Load
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(corpus)
library(corpustools)
library(dplyr)
library(tidyverse)
library(tidytext)
library(tokenizers)
library(stringr)

Here is the dataset I tried to use which is online, so the results should be reproducible:

bible <- readLines('http://bereanbible.com/bsb.txt')

And here is where everything falls apart.

Input:

 bible <- bible %>%
      unnest_tokens(word, text)

Output:

Error in tbl[[input]] : subscript out of bounds

From what I have read about this error, in Rstudio, the issue is that the dataset needs to be a matrix, so I tried transforming the dataset into a matrix table and I received the same error message.

Input:

 bible <- readLines('http://bereanbible.com/bsb.txt')


bible <- as.matrix(bible, nrow = 31105, ncol = 2 )
      
bible <- bible %>%
  unnest_tokens(word, text)

Output:

Error in tbl[[input]] : subscript out of bounds

Any recommendations for what next steps I could take or maybe some good Text mining sources I could use as I continue to dive into this would be very much appreciated.


Solution

  • The problem is that readLines()creates a vector, not a dataframe, as expected by unnest_tokens(), so you need to convert it. It is also helpful to separate the verse to it's own column:

    library(tidytext)
    library(tidyverse)
    
    bible_orig <- readLines('http://bereanbible.com/bsb.txt')
    
    # Get rid of the copyright etc.
    bible_orig <- bible_orig[4:length(bible_orig)]
    
    # Convert to df
    bible <- enframe(bible_orig)
    
    # Separate verse from text
    bible <- bible %>% 
     separate(value, into = c("verse", "text"), sep = "\t")
    
    tidy_bible <- bible %>% 
      unnest_tokens(word, text)
    
    tidy_bible
    #> # A tibble: 730,130 x 3
    #>     name verse       word     
    #>    <int> <chr>       <chr>    
    #>  1     1 Genesis 1:1 in       
    #>  2     1 Genesis 1:1 the      
    #>  3     1 Genesis 1:1 beginning
    #>  4     1 Genesis 1:1 god      
    #>  5     1 Genesis 1:1 created  
    #>  6     1 Genesis 1:1 the      
    #>  7     1 Genesis 1:1 heavens  
    #>  8     1 Genesis 1:1 and      
    #>  9     1 Genesis 1:1 the      
    #> 10     1 Genesis 1:1 earth    
    #> # … with 730,120 more rows
    

    Created on 2020-07-14 by the reprex package (v0.3.0)