Search code examples
rnlptext-miningquanteda

tokenizing on a pdf for quantitative analysis


I ran into an issue using the unnest_tokens function on a data_frame. I am working with pdf files I want to compare.

text_path <- "c:/.../text1.pdf"
text_raw <- pdf_text("c:/.../text1.pdf")
text1df<- data_frame(Zeile = 1:25, 
                      text_raw)

So far so good. But here comes my problemo:

  unnest_tokens(output = token, input = content) -> text1_long

Error: Must extract column with a single valid subscript. x Subscript var has the wrong type function. i It must be numeric or character.

I want to tokenize my pdf files so I can analyse the word frequencies and maybe compare multiple pdf files on wordclouds.


Solution

  • Here is a piece of simple code. I kept your German words so you can copy paste everything.

    library(pdftools)
    library(dplyr)
    library(stringr)
    library(tidytext)
    
    file_location <- "d:/.../my_doc.pdf"
    text_raw <- pdf_text(file_location)
    # Zeile 12 because I only have 12 pages
    text1df <- data_frame(Zeile = 1:12, 
                         text_raw) 
    
    text1df_long <- unnest_tokens(text1df , output = wort, input = text_raw ) %>% 
      filter(str_detect(wort, "[a-z]"))
    
    text1df_long
    # A tibble: 4,134 x 2
       Zeile wort       
       <int> <chr>      
     1     1 training   
     2     1 and        
     3     1 development
     4     1 policy     
     5     1 contents   
     6     1 policy     
     7     1 statement  
     8     1 scope      
     9     1 induction  
    10     1 training   
    # ... with 4,124 more rows