I've found the bbox coordinates in the lxml file and managed to extract the wanted data with PDFQuery. Then I write the data to a csv file.
def pdf_scrape(pdf):
"""
Extract each relevant information individually
input: pdf to be scraped
returns: dataframe of scraped data
"""
# Define coordinates of text to be extracted
CUSTOMER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 563.285, 624.656, 580.888")').text()
CUSTOMER_REF = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 534.939, 443.186, 552.542")').text()
SALES_ORDER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 504.692, 414.352, 522.295")').text()
ITEM_NUMBER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 478.246, 395.129, 495.849")').text()
KEY = '0000'+ SALES_ORDER + '-' + '00' + ITEM_NUMBER
# Combine all relevant information into a single pandas dataframe
page = pd.DataFrame({
'KEY' : KEY,
'CUSTOMER' : CUSTOMER,
'CUSTOMER REF.': CUSTOMER_REF,
'SALES ORDER' : SALES_ORDER,
'ITEM NUMBER' : ITEM_NUMBER
}, index=[0])
return(page)
pdf_search = Path("files/").glob("*.pdf")
pdf_files = [str(file.absolute()) for file in pdf_search]
master = list()
for pdf_file in pdf_files:
pdf = pdfquery.PDFQuery(pdf_file)
pdf.load(0)
# Iterate over all pages in document and add scraped data to df
page = pdf_scrape(pdf)
master.append(page)
master = pd.concat(master, ignore_index=True)
master.to_csv('scraped_PDF_as_csv\scraped_PDF_DataFrame.csv', index = False)
The problem is that I need to read through hundres of PDFs each day, and this script takes ~13-14 seconds to mine four elements from the first page of only 10 PDFs.
Is there a way to speed up my code? I've looked at the this: https://github.com/py-pdf/benchmarks which implies that PDFQuery is very slow compared to other libraries.
I've tried using PyMuPDF as it's supposed to be faster, but I'm having trouble implementing it to give the same output as PDFQuery. Does anyone know how to do this?
To reiterate, I know where in the document the desired text is, but I don't necessarily know what it says.
I've explored PyMuPDF a little as I've answered other questions, here on SO, but I have no personal/practical experience with it. I knew nothing of PDFQuery before this post. Still, I can show my take on a very basic sample of getting a single piece of text based on location with PyMuPDF.
Also, you don't need to infer from those timings that PDFQuery is slow, the author points this out multiple times in the docs:
Performance Note: The initial call to pdf.load() runs very slowly, because the underlying pdfminer library has to compare every element on the page to every other element. See the Caching section to avoid this on subsequent runs.
import pdfquery
query1 = (176.4, 629.28, 176.4, 629.28) # "Text 1" in simple.pdf
pdf = pdfquery.PDFQuery("simple.pdf")
# query1 = (130, 407, 130, 407) # Looking for "Gaussian" in more_complicated.pdf
# pdf = pdfquery.PDFQuery("more_complicated.pdf")
pdf.load(0)
text1 = pdf.pq('LTTextLineHorizontal:overlaps_bbox("%d, %d, %d, %d")' % query1).text()
print(text1)
I'm still not sure how to best approach this task with PyMuPDF, but here's a way that at least gives me the target texts for both simple and complicated:
from fitz import open as fitz_open, Document, Page, Rect
query1 = Rect(165.6, 165.6, 165.6, 165.6) # "Text 1" in simple.pdf
doc: Document = fitz_open("simple.pdf")
# query1 = Rect(130, 381, 130, 381) # Looking for "Gaussian" in more_complicated.pdf
# doc: Document = fitz_open("more_complicated.pdf")
page: Page = doc.load_page(0)
page_dict: dict = page.get_text("dict")
bbox: Rect # a variable we'll reuse as we work down to our query
text1 = "" # the text we're looking for with query1
block: dict
for block in page_dict["blocks"]:
if block["type"] == 1: # skip, it's an image
continue
bbox = Rect(block["bbox"])
if not bbox.contains(query1):
continue
line: dict
for line in block["lines"]:
bbox = Rect(line["bbox"])
if not bbox.contains(query1):
continue
span: dict
for span in line["spans"]:
bbox = Rect(span["bbox"])
if not bbox.contains(query1):
continue
text1 = span["text"]
print(text1)
(You might have noticed that the query coordinates are different between PDFQuery and PyMuPDF, and that's because PDFQuery uses the bottom-left as the origin, and PyMuPDF uses the upper-left as the origin.)
I also measured the run times with the time command on macOS 12.4; average of 3 runs. Here are my results for running both PDFQuery and PyMuPDF against simple.pdf and more_complicated.pdf:
PyMuPDF runs both PDFs in almost the same time, and I think we're seeing PDFQuery taking longer to make those n**2/2
cross-comparisons.
I think you'll be giving up a lot of convenience to try and do this yourself. If your PDFs are consistent you could probably tune PyMuPDF and get it just right, but if there's variation as to how they were created it might take longer to get right (if even ever, because text in PDFs is deceptively tricky).