Search code examples
rpdftidyversetidytextpdftools

Trying to extract a subset of pages from each pdf in a directory with 70 pdf files


I am using tidyverse, tidytext, and pdftools. I want to parse words in a directory of 70 pdf files. I am using these tools to do this successfully but the code below grabs all the pages instead of the subset I want. I need to skip the first two pages and select page 3 to the end of the file for each pdf.

directory <- "Student_Artifacts/"
pdfs <- paste(directory, "/", list.files(directory, pattern = "*.pdf"), sep = "")
pdf_names <- list.files(directory, pattern = "*.pdf")
pdfs_text <- map(pdfs, (pdf_text))
my_data <- data_frame(document = pdf_names, text = pdfs_text)

I figured out that by putting [3:12] in brackets like this I can grab the 3rd-12th documents:

pdfs_text <- map(pdfs, (pdf_text))[3:12]

This is not what I want though. How do I use the [3:12] specification to pull the pages I want from each pdf file?


Solution

  • First off, you could index out the 3rd-to-12th page from each PDF within the mapping of pdf_text, with just some very small changes:

    pdfs_text <- map(pdfs, ~ pdf_text(.x)[3:12])
    

    But this assumes that all 70 of your PDFs are 13 pages. This might also be slow, especially if some of them are real big. Try something like this (I used R's PDF documentation to demo with):

    library(furrr)
    #> Loading required package: future
    library(pdftools)
    library(tidyverse)
    library(magrittr)
    #> 
    #> Attaching package: 'magrittr'
    #> The following object is masked from 'package:purrr':
    #> 
    #>     set_names
    #> The following object is masked from 'package:tidyr':
    #> 
    #>     extract
    
    plan(multiprocess)
    
    directory <- file.path(R.home("doc"), "manual")
    pdf_names <- list.files(directory, pattern = "\\.pdf$", full.names = TRUE)
    # Drop the full reference manual since it's so big
    pdf_names %<>% str_subset("fullrefman.pdf", negate = TRUE)
    pdfs_text <- future_map(pdf_names, pdf_text, .progress = TRUE)
    #> Progress: ----------------------------------------------------------------------------------- 100%
    
    my_data   <- tibble(
      document = basename(pdf_names), 
      text     = map_chr(pdfs_text, ~ {
        str_c("Page ", seq_along(.x), ": ", str_squish(.x)) %>% 
          tail(-2) %>% 
          str_c(collapse = "; ")
      })
    )
    
    my_data
    #> # A tibble: 6 x 2
    #>   document    text                                                         
    #>   <chr>       <chr>                                                        
    #> 1 R-admin.pdf "Page 3: i Table of Contents 1 Obtaining R . . . . . . . . .~
    #> 2 R-data.pdf  "Page 3: i Table of Contents Acknowledgements . . . . . . . ~
    #> 3 R-exts.pdf  "Page 3: i Table of Contents Acknowledgements . . . . . . . ~
    #> 4 R-intro.pdf "Page 3: i Table of Contents Preface . . . . . . . . . . . .~
    #> 5 R-ints.pdf  "Page 3: i Table of Contents 1 R Internal Structures . . . .~
    #> 6 R-lang.pdf  "Page 3: i Table of Contents 1 Introduction . . . . . . . . ~
    

    Created on 2019-10-19 by the reprex package (v0.3.0)

    The main points:

    1. The tail(-2) is doing the work you're most concerned with: dropping the first two pages. Usually you use tail() to grab the last n pages, but it's also ideal for grabbing all but the first n pages - just use the negative.
    2. The plan() and future_map() are parallelizing the PDF-reading, with each of your virtual cores reading one PDF at a time. Also, progress bar!
    3. I'm doing some fancy string concatenation in the construction of text here since it appears that you ultimately want the full text of each document's pages in one cell in your final table. I'm inserting "; Page [n]: " in between each page's text so that data isn't lost, and I'm also removing extra whitespace throughout all the text, since there's usually tons.