Search code examples
rstringpdfnested-listsarea

How to select pages from many PDFs that have a common character string located in the same area using pdf_data from pdftools?


Let’s consider this PDF, imported in R as follows:

library(pdftools)
library(tidyverse)
mylink <- "https://www.probioqual.com/12_PDF/02_EEQ/Modele_Rapport_EEQ.pdf"
mypdf <- pdf_data(mylink)

The pdf_data function generates a large list composed of 35 pages (one tibble per page, each including n rows and 6 columns, among which x and y coordinates).

Let’s now consider many PDFs in a file, imported using:

mypdfs_list <- list.files(pattern = '*.pdf')
allpdfs <- lapply(mypdfs_list, pdf_data)

which gives:
enter image description here

Among allpdfs, I would like to select only those pages that contain the "Limites acceptables" characters string in the top right box, e.g. as highlighted in yellow on page 5 of the pdf:

enter image description here

NB: selecting this specific string is the way I found to select only the pages that contain the tables of interest. Indeed, the first text pages of each pdf (the number of which may vary from one pdf to another) do not interest me so I want to discard them; e.g., in the pdf above, I want to discard the first 4 pages of text (but in another pdf, the first 3 or the first 5 would have to be removed, for example).

Using pdftools::pdf_data, the "Limites acceptables" string is always located inside the area of coordinates x>360 & x<580 & y>26 & y<35.

Question: is it possible, using a function (map, lapply or other, including e.g. filter or other) to select only these pages (thus discarding the first text pages) among all lists from imported pdfs?

Of course open to any other approach!

Thanks


Solution

  • A slightly complicated solution, but it works:

    # import all pdfs from the file
    mypdfs_list <- list.files(pattern = '*.pdf')
    allpdfs <- lapply(mypdfs_list, pdf_data)
    
    # map nested lists to mutate 'page_ok' = 1 when fixed areas contain specific texts
    allpdfs  <- map(allpdfs, ~ .x %>%
                       map(~ mutate(., page_ok = case_when(.$x>360 & .$x<580 & .$y>26 & .$y<35 &
                                                           (.$text=="Li" | .$text=="mi") ~ 1, TRUE ~ 0))))
                           
    # map nested lists to fill 'page_ok' with 1 if this variable contains at least one value 1
    allpdfs  <- map(allpdfs , ~ .x %>%
                       map(~ mutate(., page_ok = if_else(any(.$page_ok == 1), 1, 0))))
    
    # map nested lists to keep only sublists (i.e., tibbles that correspond to pages) for which the sum of 'page_ok' is not 0
    allpdfs  <- map(allpdfs , ~ .x %>%
                       keep(~ sum(.$page_ok) != 0))
    

    The first uninteresting text pages were thus deleted. Compare below to above RStudio screenshots: the 1st pdf has now 26 pages instead of 29, the 2nd pdf 35 pages instead of 38, the 3rd pdf 26 pages instead of 28...

    enter image description here

    I would have liked to be able to combine these 3 steps into one. Would there be a simpler solution?