Search code examples
pythonpdfpdfminer

Obtaining data from a PDF file with the same layout as with a copy+paste


I have a procedure which I'm looking to automate which envolves getting a series of tables from a PDF file. Currently I can do so by opening the file in any viewer(Adobe, Sumatra, okular, etc...) and just Ctrl+A, Ctrl+C, Ctrl+V it unto Notepad, and it mantains each line aligned with a reasonable enough format that then I can just run a regex and copy and paste it into Excel for whatever is needed afterwards.

When trying to do this with python I tried various modules, PDFminer the main one which sort of works by using this example for instance. But it returns the data in a single column. Other options include just getting it as an html table, but in this case it adds extra splitting mid-table which make the parsing more complicated or even switches columns around between the first and second pages occasionally.

I've gotten a temporary solution working for now, but I'm worried I'm reinventing the wheel when I'm probably just missing a core option in the parser or that I need to consider some fundamental option of the way the PDF renderer works to solve this.

Any ideas from how to approach it?


Solution

  • I ended up implementing a solution based on this one, by itself modified from a code by tgray. It works consistently in all of the cases I've tested so far, but I have yet to identify how to manipulate pdfminer's parameters directly to obtain the desired behaviour.