Search code examples
pythonpdfocrpython-tesseractpypdf

Read specific region from PDF


I'm trying to read a specific region on a PDF file. How to do it?

I've tried:

  1. Using PyPDF2, cropped the PDF page and read only that. It doesn't work because PyPDF2's cropbox only shrinks the "view", but keeps all the items outside the specified cropbox. So on reading the cropped pdf text with extract_text(), it reads all the "invisible" contents, not only the cropped part.
  2. Converting the PDF page to PNG, cropping it and using Pytesseract to read the PNG. Py tesseract doesn't work properly, don't know why.

Solution

  • PyMuPDF can probably do this.

    I just answered another question regarding getting the "highlighted text" from a page, but the solution uses the same relevant parts of the PyMuPDF API you want:

    • figure out a rectangle that defines the area of interest
    • extract text based on that rectangle

    and I say "probably" because I haven't actually tried it on your PDF, so I cannot say for certain that the text is amenable to this process.

    import os.path
    
    import fitz
    from fitz import Document, Page, Rect
    
    
    # For visualizing the rects that PyMuPDF uses compared to what you see in the PDF
    VISUALIZE = True
    
    input_path = "test.pdf"
    doc: Document = fitz.open(input_path)
    
    for i in range(len(doc)):
        page: Page = doc[i]
        page.clean_contents()  # https://pymupdf.readthedocs.io/en/latest/faq.html#misplaced-item-insertions-on-pdf-pages
    
        # Hard-code the rect you need
        rect = Rect(0, 0, 100, 100)
    
        if VISUALIZE:
            # Draw a red box to visualize the rect's area (text)
            page.draw_rect(rect, width=1.5, color=(1, 0, 0))
    
        text = page.get_textbox(rect)
    
        print(text)
    
    
    if VISUALIZE:
        head, tail = os.path.split(input_path)
        viz_name = os.path.join(head, "viz_" + tail)
        doc.save(viz_name)
    

    For context, here's the project I just finished where this was working for the highlighted text, https://github.com/zacharysyoung/extract_highlighted_text.