Search code examples
pythonpdfpdfminerpymupdf

Extracting text in known bbox from pdf, PDFQuery too slow


I've found the bbox coordinates in the lxml file and managed to extract the wanted data with PDFQuery. Then I write the data to a csv file.

def pdf_scrape(pdf):
    """
    Extract each relevant information individually
    input: pdf to be scraped
    returns: dataframe of scraped data
    """
    # Define coordinates of text to be extracted
    CUSTOMER             = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 563.285, 624.656, 580.888")').text() 
    CUSTOMER_REF         = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 534.939, 443.186, 552.542")').text()
    SALES_ORDER          = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 504.692, 414.352, 522.295")').text()
    ITEM_NUMBER          = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 478.246, 395.129, 495.849")').text()
    KEY                  = '0000'+ SALES_ORDER + '-' + '00' + ITEM_NUMBER
    # Combine all relevant information into a single pandas dataframe
    page = pd.DataFrame({
        'KEY'          : KEY,
        'CUSTOMER'     : CUSTOMER,
        'CUSTOMER REF.': CUSTOMER_REF,
        'SALES ORDER'  : SALES_ORDER,
        'ITEM NUMBER'  : ITEM_NUMBER
                       }, index=[0])
    return(page)

pdf_search = Path("files/").glob("*.pdf")

pdf_files = [str(file.absolute()) for file in pdf_search]

master = list()
for pdf_file in pdf_files: 
    pdf = pdfquery.PDFQuery(pdf_file)
    pdf.load(0)

# Iterate over all pages in document and add scraped data to df
    page = pdf_scrape(pdf) 
    master.append(page)

master = pd.concat(master, ignore_index=True)
master.to_csv('scraped_PDF_as_csv\scraped_PDF_DataFrame.csv', index = False)

The problem is that I need to read through hundres of PDFs each day, and this script takes ~13-14 seconds to mine four elements from the first page of only 10 PDFs.

Is there a way to speed up my code? I've looked at the this: https://github.com/py-pdf/benchmarks which implies that PDFQuery is very slow compared to other libraries.

I've tried using PyMuPDF as it's supposed to be faster, but I'm having trouble implementing it to give the same output as PDFQuery. Does anyone know how to do this?

To reiterate, I know where in the document the desired text is, but I don't necessarily know what it says.


Solution

  • I've explored PyMuPDF a little as I've answered other questions, here on SO, but I have no personal/practical experience with it. I knew nothing of PDFQuery before this post. Still, I can show my take on a very basic sample of getting a single piece of text based on location with PyMuPDF.

    Also, you don't need to infer from those timings that PDFQuery is slow, the author points this out multiple times in the docs:

    Performance Note: The initial call to pdf.load() runs very slowly, because the underlying pdfminer library has to compare every element on the page to every other element. See the Caching section to avoid this on subsequent runs.

    PDFQuery

    import pdfquery
    
    query1 = (176.4, 629.28, 176.4, 629.28)  # "Text 1" in simple.pdf
    pdf = pdfquery.PDFQuery("simple.pdf")
    
    # query1 = (130, 407, 130, 407)  # Looking for "Gaussian" in more_complicated.pdf
    # pdf = pdfquery.PDFQuery("more_complicated.pdf")
    
    pdf.load(0)
    
    text1 = pdf.pq('LTTextLineHorizontal:overlaps_bbox("%d, %d, %d, %d")' % query1).text()
    
    print(text1)
    

    PyMuPDF

    I'm still not sure how to best approach this task with PyMuPDF, but here's a way that at least gives me the target texts for both simple and complicated:

    from fitz import open as fitz_open, Document, Page, Rect
    
    query1 = Rect(165.6, 165.6, 165.6, 165.6)  # "Text 1" in simple.pdf
    doc: Document = fitz_open("simple.pdf")
    
    # query1 = Rect(130, 381, 130, 381)  # Looking for "Gaussian" in more_complicated.pdf
    # doc: Document = fitz_open("more_complicated.pdf")
    
    page: Page = doc.load_page(0)
    
    page_dict: dict = page.get_text("dict")
    
    bbox: Rect  # a variable we'll reuse as we work down to our query
    text1 = ""  # the text we're looking for with query1
    
    block: dict
    for block in page_dict["blocks"]:
        if block["type"] == 1:  # skip, it's an image
            continue
    
        bbox = Rect(block["bbox"])
        if not bbox.contains(query1):
            continue
    
        line: dict
        for line in block["lines"]:
    
            bbox = Rect(line["bbox"])
            if not bbox.contains(query1):
                continue
    
            span: dict
            for span in line["spans"]:
    
                bbox = Rect(span["bbox"])
                if not bbox.contains(query1):
                    continue
    
                text1 = span["text"]
    
    print(text1)
    

    Analysis

    (You might have noticed that the query coordinates are different between PDFQuery and PyMuPDF, and that's because PDFQuery uses the bottom-left as the origin, and PyMuPDF uses the upper-left as the origin.)

    I also measured the run times with the time command on macOS 12.4; average of 3 runs. Here are my results for running both PDFQuery and PyMuPDF against simple.pdf and more_complicated.pdf:

    simple.pdf more_complicated.pdf
    file simple complicated
    PDFQuery timing (s) 0.123 0.258
    PyMuPDF timing (s) 0.069 0.070

    PyMuPDF runs both PDFs in almost the same time, and I think we're seeing PDFQuery taking longer to make those n**2/2 cross-comparisons.

    I think you'll be giving up a lot of convenience to try and do this yourself. If your PDFs are consistent you could probably tune PyMuPDF and get it just right, but if there's variation as to how they were created it might take longer to get right (if even ever, because text in PDFs is deceptively tricky).