Hi first of all thanks for the help. I would like to know if there’s a way to extract specific data that is allocated in the same place in all pages from a pdf editable file.
The file (modified to comply with privacy concerns) contains a series of payroll receipts, all pages contain the same format and data. I would like to extract only the SSN (No. IMSS) of each employee and put them on a data frame. I have searched for how to do this but I have only found cases where the data is not properly structered and since in this file all pages are exactly equal, I would like to know if there's a less troublesome way.
Using pdf tools and the steps bellow I was able to isolate the data I wanted (allocated on line 9), but only from an individual page. I would like to know if it’s possible to enter a command that works for all pages. Thank you.
> library(pdftools)
> test <- pdf_text("pruebas.pdf")
> orden <- strsplit(test,"\r\n")
> required <- c(unlist(strsplit(orden2[[1]],"\r\n")))
> nss <- required[9]
> result <- as.data.frame(nss)
This is a text parsing task and there are several ways to do it. Perhaps the quickest way is to split the output at every No. IMSS:
, select the second fragments, split the result at the line break, then take the first fragment. The code isn't pretty, but it works:
sapply(strsplit(sapply(strsplit(pdftools::pdf_text("pruebas.pdf"),
"No\\. IMSS: +"), `[`, 2), "\r"), `[`, 1)
#> [1] "12-34-56-7895-5" "12-34-56-7895-9" "12-34-56-7895-7" "12-34-56-7895-1"