Read specific region from PDF

I'm trying to read a specific region on a PDF file. How to do it?

I've tried:

Using PyPDF2, cropped the PDF page and read only that. It doesn't work because PyPDF2's cropbox only shrinks the "view", but keeps all the items outside the specified cropbox. So on reading the cropped pdf text with extract_text(), it reads all the "invisible" contents, not only the cropped part.
Converting the PDF page to PNG, cropping it and using Pytesseract to read the PNG. Py tesseract doesn't work properly, don't know why.

Solution

PyMuPDF can probably do this.

I just answered another question regarding getting the "highlighted text" from a page, but the solution uses the same relevant parts of the PyMuPDF API you want:

figure out a rectangle that defines the area of interest
extract text based on that rectangle

and I say "probably" because I haven't actually tried it on your PDF, so I cannot say for certain that the text is amenable to this process.

import os.path

import fitz
from fitz import Document, Page, Rect


# For visualizing the rects that PyMuPDF uses compared to what you see in the PDF
VISUALIZE = True

input_path = "test.pdf"
doc: Document = fitz.open(input_path)

for i in range(len(doc)):
    page: Page = doc[i]
    page.clean_contents()  # https://pymupdf.readthedocs.io/en/latest/faq.html#misplaced-item-insertions-on-pdf-pages

    # Hard-code the rect you need
    rect = Rect(0, 0, 100, 100)

    if VISUALIZE:
        # Draw a red box to visualize the rect's area (text)
        page.draw_rect(rect, width=1.5, color=(1, 0, 0))

    text = page.get_textbox(rect)

    print(text)


if VISUALIZE:
    head, tail = os.path.split(input_path)
    viz_name = os.path.join(head, "viz_" + tail)
    doc.save(viz_name)

For context, here's the project I just finished where this was working for the highlighted text, https://github.com/zacharysyoung/extract_highlighted_text.