Search code examples
pythonpdftext-extractionpypdfpdfminer

Extract pdf text within bounding box directly into python


I'm trying to extract the text of a pdf within a given bounding rectangle. I understand there are tools for pdf scraping such as pdfminer, pypdf, and pdftotext. I've experimented with all 3, and so far I've only gotten code for pdftotext to extract text from within a given bounding box. That code looks something like this:

s = "pdftotext -x %d -y %d -w %d -h %d"
s = s%(<various inputs into my function>)
cmd = [s, pdf_path,
           text_out]
subprocess.call(cmd)

However, this outputs/writes a text file. I want to use that text ~immediately, meaning I don't want to go and have to open a text file to retrieve whatever words were in that bounding box as I'll be doing that for 10,000+ documents and opening that many files might be a pain. I'm basically running the command line prompt from my python script, so I don't think there'll actually be a way around that, but I'm unsure. Since pdfminer & pypdf are actual python packages, I can get their text, but they don't appear to have any means of extracting text within given pixel limits.

As a further note - I'm looking to do this in python specifically, as I have a ton of other code for the same overarching project.


Solution

  • The PyMuPDF/Fitz Package works for this. They provide a script & documentation at: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction

    Their script works by finding the bounding words, you can instead replace it by a rectangle by simply doing rect = fitz.Rect(x0, y0, x1, y1) instead of their rect = ~their stuff~. Also pno is the page number you're extracting from if its not clear.