Search code examples
pythonpdfpypdfpdfminer

Python read pdf in sections


I'm trying to read a pdf file where each page is divided into 3x3 blocks of information of the form

A | B | C
D | E | F
G | H | I

Each of the entries is broken into multiple lines. A simplified example of one entry is this card. But then there would be similar cards in the other 8 slots. I'd like to be able to read A, then B, then C…; however, I could survive if I read the first line of the A, B, and C, and then the second line of A, B, and C, etc. I've looked at pdfminer and pypdf, but I haven't seen anything to fit what I'm looking for. The answer here works fairly well, but the order of
columns routinely gets distorted.


Solution

  • In the second answer here replace

    self.rows = sorted(self.rows, key = lambda x: (x[0], -x[2]))
    

    by

    self.rows = sorted(self.rows, key = lambda x: (x[0], -x[2], x[1]))
    

    Very important: See the last paragraph of this answer.