Search code examples
rpdftext-mining

How to read PDFs line by line in R?


I was using read_pdf() function from pdftools package to read PDF files line by line, but suddenly without changing anything in the script, any argument or line, it started reading the whole page instead of separating the elements by line. How do I get it to go back to line by line separation? This is the only way I can use text mining to build the database I need.


Solution

  • With the following code, you can text line by line by reading the PDF file directly

    library(pdftools)
    library(pagedown)
    
    chrome_print(input = "https://en.wikipedia.org/wiki/Cat", 
                 output = "D:\\Text_PDF_Cat.pdf")
    
    text <- pdf_text("D:\\Text_PDF_Cat.pdf")
    text <- lapply(X = text, FUN = function(x) strsplit(x, "\n"))
    text <- unlist(text)