I have a lot of PDFs which are in two-column format. I am using the pdftools
package in R. Is there a way to read each PDF according to the two-column format without cropping each PDF individually?
Each PDF consists of selectable text, and the pdf_text
function has no problem reading the text, the only issue is that it will read the first line of the first column, then proceed to the next column, instead of moving down the first column.
Thank you very much in advance for your help.
I'd the same problem. What I did was to get the most frequent space values for each of my pdfs pages and stored it into a Vector. Then I sliced it using that value.
library(pdftools)
src <- ""
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
QTD_COLUMNS <- 2
read_text <- function(text) {
result <- ''
#Get all index of " " from page.
lstops <- gregexpr(pattern =" ",text)
#Puts the index of the most frequents ' ' in a vector.
stops <- as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))
#Slice based in the specified number of colums (this can be improved)
for(i in seq(1, QTD_COLUMNS, by=1))
{
temp_result <- sapply(text, function(x){
start <- 1
stop <-stops[i]
if(i > 1)
start <- stops[i-1] + 1
if(i == QTD_COLUMNS)#last column, read until end.
stop <- nchar(x)+1
substr(x, start=start, stop=stop)
}, USE.NAMES=FALSE)
temp_result <- trim(temp_result)
result <- append(result, temp_result)
}
result
}
txt <- pdf_text(src)
result <- ''
for (i in 1:length(txt)) {
page <- txt[i]
t1 <- unlist(strsplit(page, "\n"))
maxSize <- max(nchar(t1))
t1 <- paste0(t1,strrep(" ", maxSize-nchar(t1)))
result = append(result,read_text(t1))
}
result