I ran into an issue using the unnest_tokens function on a data_frame. I am working with pdf files I want to compare.
text_path <- "c:/.../text1.pdf"
text_raw <- pdf_text("c:/.../text1.pdf")
text1df<- data_frame(Zeile = 1:25,
text_raw)
So far so good. But here comes my problemo:
unnest_tokens(output = token, input = content) -> text1_long
Error: Must extract column with a single valid subscript.
x Subscript var
has the wrong type function
.
i It must be numeric or character.
I want to tokenize my pdf files so I can analyse the word frequencies and maybe compare multiple pdf files on wordclouds.
Here is a piece of simple code. I kept your German words so you can copy paste everything.
library(pdftools)
library(dplyr)
library(stringr)
library(tidytext)
file_location <- "d:/.../my_doc.pdf"
text_raw <- pdf_text(file_location)
# Zeile 12 because I only have 12 pages
text1df <- data_frame(Zeile = 1:12,
text_raw)
text1df_long <- unnest_tokens(text1df , output = wort, input = text_raw ) %>%
filter(str_detect(wort, "[a-z]"))
text1df_long
# A tibble: 4,134 x 2
Zeile wort
<int> <chr>
1 1 training
2 1 and
3 1 development
4 1 policy
5 1 contents
6 1 policy
7 1 statement
8 1 scope
9 1 induction
10 1 training
# ... with 4,124 more rows