Search code examples
pythonpython-3.xpypdfpdfminer

Extracting text from specific coordinates of a PDF in python


I have some pre-determined coordinates that I want to look into a PDF to extract text from (some part on the top of the page). I've been trying to use the library pdfminer.six but it seems like the smallest unit for processing and extracting elements is a page.

I was thinking that in order to just get text from a small part of a page, it could get a little inefficient to go through and analyse the entire page when there are large number of documents to process.

Is there any way to do so? Or is there some other library that can work with this use case, where I can pass in coordinates? Or am I getting the concept wrong fundamentally?

Thanks!


Solution

  • You can use visitor functions to do that: https://pypdf2.readthedocs.io/en/latest/user/extract-text.html#example-1-ignore-header-and-footer