python-3.x adobe pdfbox apache-tika pdfminer

Adobe Acrobat/Python PDF Outputs Varying

I've noticed that when I use an OCR to transform a scanned PDF document into text, in this case Adobe Acrobat Pro, I'm getting very different outputs depending on how I extract the data.

In the above photo - you can see a piece of a PDF that has been OCR'ed into fairly good quality text. If I select it in Adobe and copy it to say, a word or txt doc, it paste over perfectly fine.

However, if I export it using Adobe to Rich Text Format, use Python's PDFminer, or Python Apache Tika then I get the above photo which as you can see completely jumbles it. The extraction results are very consistent between the approaches - basically all 3 jumble it in the exact same way.

Would any of you have any idea as to why an OCR'd PDF can be copied just fine to a text editor but is extracting in such a bizarre way?

Thank you!

Regards, Mano

Solution

So what ended up working for me was running the initial parsing with Apache-Tika and then, on the few that didn't work on, pass them through PyPDF2. My theory is that PyPDF2 uses a different mechanism for parsing that doesn't rely on the root of the PDF unlike Tika and that is what seems to have become corrupted in a few of these OCR'd docs.

Not sure of the initial cause but that was my solution.