Search code examples
rquantedaread-text

When reading in pdf text using readtext is there a way to ensure that readtext respects columns?


The problem is that I have a PDF document formatted in landscape with three columns of text which I am attempting to read into R using readtext(). When it reads the text in, rather than reading down each column in order, it is reading between columns across the same line of text.

To describe it simply, if the first line of each column was just a string of numbers from 1-10 and the second was a string from 11-20 then readtext() reads it in as "1234567891012345678910" rather than as "1234567891011121314..." etc.

Is there a way to specify that readtext() follows columns in my importing process?

Best, Daniel


Solution

  • The (current) answer is no. readtext uses the pdftools package to read the pdfs and this doesn't recognize the seperate columns. This has something to do with poppler that is being used to read pdfs. See also issue 4 on github. It is sort of in pdf_data but not easy to retrieve.