I'm trying to read a specific region on a PDF file. How to do it?
I've tried:
PyMuPDF can probably do this.
I just answered another question regarding getting the "highlighted text" from a page, but the solution uses the same relevant parts of the PyMuPDF API you want:
and I say "probably" because I haven't actually tried it on your PDF, so I cannot say for certain that the text is amenable to this process.
import os.path
import fitz
from fitz import Document, Page, Rect
# For visualizing the rects that PyMuPDF uses compared to what you see in the PDF
VISUALIZE = True
input_path = "test.pdf"
doc: Document = fitz.open(input_path)
for i in range(len(doc)):
page: Page = doc[i]
page.clean_contents() # https://pymupdf.readthedocs.io/en/latest/faq.html#misplaced-item-insertions-on-pdf-pages
# Hard-code the rect you need
rect = Rect(0, 0, 100, 100)
if VISUALIZE:
# Draw a red box to visualize the rect's area (text)
page.draw_rect(rect, width=1.5, color=(1, 0, 0))
text = page.get_textbox(rect)
print(text)
if VISUALIZE:
head, tail = os.path.split(input_path)
viz_name = os.path.join(head, "viz_" + tail)
doc.save(viz_name)
For context, here's the project I just finished where this was working for the highlighted text, https://github.com/zacharysyoung/extract_highlighted_text.