I am using tidyverse, tidytext, and pdftools. I want to parse words in a directory of 70 pdf files. I am using these tools to do this successfully but the code below grabs all the pages instead of the subset I want. I need to skip the first two pages and select page 3 to the end of the file for each pdf.
directory <- "Student_Artifacts/"
pdfs <- paste(directory, "/", list.files(directory, pattern = "*.pdf"), sep = "")
pdf_names <- list.files(directory, pattern = "*.pdf")
pdfs_text <- map(pdfs, (pdf_text))
my_data <- data_frame(document = pdf_names, text = pdfs_text)
I figured out that by putting [3:12] in brackets like this I can grab the 3rd-12th documents:
pdfs_text <- map(pdfs, (pdf_text))[3:12]
This is not what I want though. How do I use the [3:12] specification to pull the pages I want from each pdf file?
First off, you could index out the 3rd-to-12th page from each PDF within the mapping of pdf_text
, with just some very small changes:
pdfs_text <- map(pdfs, ~ pdf_text(.x)[3:12])
But this assumes that all 70 of your PDFs are 13 pages. This might also be slow, especially if some of them are real big. Try something like this (I used R's PDF documentation to demo with):
library(furrr)
#> Loading required package: future
library(pdftools)
library(tidyverse)
library(magrittr)
#>
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#>
#> set_names
#> The following object is masked from 'package:tidyr':
#>
#> extract
plan(multiprocess)
directory <- file.path(R.home("doc"), "manual")
pdf_names <- list.files(directory, pattern = "\\.pdf$", full.names = TRUE)
# Drop the full reference manual since it's so big
pdf_names %<>% str_subset("fullrefman.pdf", negate = TRUE)
pdfs_text <- future_map(pdf_names, pdf_text, .progress = TRUE)
#> Progress: ----------------------------------------------------------------------------------- 100%
my_data <- tibble(
document = basename(pdf_names),
text = map_chr(pdfs_text, ~ {
str_c("Page ", seq_along(.x), ": ", str_squish(.x)) %>%
tail(-2) %>%
str_c(collapse = "; ")
})
)
my_data
#> # A tibble: 6 x 2
#> document text
#> <chr> <chr>
#> 1 R-admin.pdf "Page 3: i Table of Contents 1 Obtaining R . . . . . . . . .~
#> 2 R-data.pdf "Page 3: i Table of Contents Acknowledgements . . . . . . . ~
#> 3 R-exts.pdf "Page 3: i Table of Contents Acknowledgements . . . . . . . ~
#> 4 R-intro.pdf "Page 3: i Table of Contents Preface . . . . . . . . . . . .~
#> 5 R-ints.pdf "Page 3: i Table of Contents 1 R Internal Structures . . . .~
#> 6 R-lang.pdf "Page 3: i Table of Contents 1 Introduction . . . . . . . . ~
Created on 2019-10-19 by the reprex package (v0.3.0)
The main points:
tail(-2)
is doing the work you're most concerned with: dropping the first two pages. Usually you use tail()
to grab the last n
pages, but it's also ideal for grabbing all but the first n
pages - just use the negative.plan()
and future_map()
are parallelizing the PDF-reading, with each of your virtual cores reading one PDF at a time. Also, progress bar!text
here since it appears that you ultimately want the full text of each document's pages in one cell in your final table. I'm inserting "; Page [n]: " in between each page's text so that data isn't lost, and I'm also removing extra whitespace throughout all the text, since there's usually tons.