Search code examples
pythonpdfpypdfpdfminer

Python read part of a pdf page


I'm trying to read a pdf file where each page is divided into 3x3 blocks of information of the form

A | B | C
D | E | F
G | H | I

Each of the entries is broken into multiple lines. A simplified example of one entry is this card. But then there would be similar entries in the other 8 slots.

I've looked at pdfminer and pypdf2. I haven't found pdfminer overly useful, but pypdf2 has given me something close.

import PyPDF2
from StringIO import StringIO
def getPDFContent(path):
    content = ""
    p = file(path, "rb")
    pdf = PyPDF2.PdfFileReader(p)
    numPages = pdf.getNumPages()
    for i in range(numPages):
        content += pdf.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

However, this only reads the file line by line. I'd like a solution where I can read only a portion of the page so that I could read A, then B, then C, and so on. Also, the answer here works fairly well, but the order of
columns routinely gets distorted and I've only gotten it to read line by line.


Solution

  • I assume the PDF files in question are generated PDFs rather than scanned (as in the example you gave), given that you're using pdfminer and pypdf2. If you know the size of the columns and rows in inches you can use minecart (full disclosure: I wrote minecart). Example code:

    import minecart
    
    # minecart units are 1/72 inch, measured from bottom-left of the page
    ROW_BORDERS = (
        72 * 1,  # Bottom row starts 1 inch from the bottom of the page
        72 * 3,  # Second row starts 3 inches from the bottom of the page
        72 * 5,  # Third row starts 5 inches from the bottom of the page
        72 * 7,  # Third row ends 7 inches from the bottom of the page
    )
    COLUMN_BORDERS = (
        72 * 8,  # Third col ends 8 inches from the left of the page
        72 * 6,  # Third col starts 6 inches from the left of the page
        72 * 4,  # Second col starts 4 inches from the left of the page   
        72 * 2,  # First col starts 2 inches from the left of the page
    )  # reversed so that BOXES is ordered properly
    BOXES = [
        (left, bot, right, top)
        for top, bot in zip(ROW_BORDERS, ROW_BORDERS[1:])
        for left, right in zip(COLUMN_BORDERS, COLUMN_BORDERS[1:])
    ]
    
    def extract_output(page):
        """
        Reads the text from page and splits it into the 9 cells.
    
        Returns a list with 9 entries: 
    
            [A, B, C, D, E, F, G, H, I]
    
        Each item in the tuple contains a string with all of the
        text found in the cell.
    
        """
        res = []
        for box in BOXES:
            strings = list(page.letterings.iter_in_bbox(box))
            # We sort from top-to-bottom and then from left-to-right, based
            # on the strings' top left corner
            strings.sort(key=lambda x: (-x.bbox[3], x.bbox[0]))
            res.append(" ".join(strings).replace(u"\xa0", " ").strip())
        return res
    
    content = []
    doc = minecart.Document(open("path/to/pdf-doc.pdf", 'rb'))
    for page in doc.iter_pages():
        content.append(extract_output(page))