I have been trying to follow Text Mining with R by Julia Silge, however, I cannot tokenize my dataset with the unnest_tokens function.
Here are the packages I have loaded:
# Load
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(corpus)
library(corpustools)
library(dplyr)
library(tidyverse)
library(tidytext)
library(tokenizers)
library(stringr)
Here is the dataset I tried to use which is online, so the results should be reproducible:
bible <- readLines('http://bereanbible.com/bsb.txt')
And here is where everything falls apart.
Input:
bible <- bible %>%
unnest_tokens(word, text)
Output:
Error in tbl[[input]] : subscript out of bounds
From what I have read about this error, in Rstudio, the issue is that the dataset needs to be a matrix, so I tried transforming the dataset into a matrix table and I received the same error message.
Input:
bible <- readLines('http://bereanbible.com/bsb.txt')
bible <- as.matrix(bible, nrow = 31105, ncol = 2 )
bible <- bible %>%
unnest_tokens(word, text)
Output:
Error in tbl[[input]] : subscript out of bounds
Any recommendations for what next steps I could take or maybe some good Text mining sources I could use as I continue to dive into this would be very much appreciated.
The problem is that readLines()
creates a vector, not a dataframe, as expected by unnest_tokens()
, so you need to convert it. It is also helpful to separate the verse to it's own column:
library(tidytext)
library(tidyverse)
bible_orig <- readLines('http://bereanbible.com/bsb.txt')
# Get rid of the copyright etc.
bible_orig <- bible_orig[4:length(bible_orig)]
# Convert to df
bible <- enframe(bible_orig)
# Separate verse from text
bible <- bible %>%
separate(value, into = c("verse", "text"), sep = "\t")
tidy_bible <- bible %>%
unnest_tokens(word, text)
tidy_bible
#> # A tibble: 730,130 x 3
#> name verse word
#> <int> <chr> <chr>
#> 1 1 Genesis 1:1 in
#> 2 1 Genesis 1:1 the
#> 3 1 Genesis 1:1 beginning
#> 4 1 Genesis 1:1 god
#> 5 1 Genesis 1:1 created
#> 6 1 Genesis 1:1 the
#> 7 1 Genesis 1:1 heavens
#> 8 1 Genesis 1:1 and
#> 9 1 Genesis 1:1 the
#> 10 1 Genesis 1:1 earth
#> # … with 730,120 more rows
Created on 2020-07-14 by the reprex package (v0.3.0)