Search code examples
pythonpdftext-extractionpdfminerpdf2image

How to extract text boxes from a pdf and convert them to image


I'm trying to get cropped boxes from a pdf that has text in, this will be very usefull to gather training data for one of my models and that's why I need it. Here's a pdf sample: https://github.com/tomasmarcos/tomrep/blob/tomasmarcos-example2delete/example%20-%20Git%20From%20Bottom%20Up.pdf ; for example I would like to get the first boxtext within as an image (jpg or whatever), like this:

enter image description here

What I tried so far is the following code, but I'm open to solve this in other ways so if you have another way, it's nice. This code is a modified version from a solution (first answer) that I found here How to extract text and text coordinates from a PDF file? ; (only PART I of my code) ; part II is what I tried but didn't work so far, I also tried to read the image with pymupdf but didn't change anything at all (I won't post this attempt since the post is large enough).

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer
import os
import pandas as pd
import pdf2image
import numpy as np
import PIL
from PIL import Image
import io

# pdf path 
pdf_path ="example - Git From Bottom Up.pdf"

# PART 1: GET LTBOXES COORDINATES IN THE IMAGE
# Open a PDF file.
fp = open(pdf_path, 'rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)

# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)

# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()

# Create a PDF device object.
device = PDFDevice(rsrcmgr)

# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)

# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)


# here is where i stored the data
boxes_data = []
page_sizes = []

def parse_obj(lt_objs, verbose = 0):
    # loop over the object list
    for obj in lt_objs:
        # if it's a textbox, print text and location
        if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
            if verbose >0:
                print("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text()))
            data_dict = {"startX":round(obj.bbox[0]),"startY":round(obj.bbox[1]),"endX":round(obj.bbox[2]),"endY":round(obj.bbox[3]),"text":obj.get_text()}
            boxes_data.append(data_dict)
        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)

# loop over all pages in the document
for page in PDFPage.create_pages(document):
    # read the page into a layout object
    interpreter.process_page(page)
    layout = device.get_result()
    # extract text from this object
    parse_obj(layout._objs)
    mediabox = page.mediabox
    mediabox_data = {"height":mediabox[-1], "width":mediabox[-2]}
    page_sizes.append(mediabox_data)

Part II of the code, getting the cropped box in image format.

# PART 2: NOW GET PAGE TO IMAGE
firstpage_size = page_sizes[0]
firstpage_image = pdf2image.convert_from_path(pdf_path,size=(firstpage_size["height"],firstpage_size["width"]))[0]
#show first page with the right size (at least the one that pdfminer says)
firstpage_image.show()

#first box data
startX,startY,endX,endY,text = boxes_data[0].values()
# turn image to array
image_array = np.array(firstpage_image)
# get cropped box
box = image_array[startY:endY,startX:endX,:]
convert2pil_image = PIL.Image.fromarray(box)
#show cropped box image
convert2pil_image.show()
#print this does not match with the text, means there's an error
print(text)

As you see, coordinates of the box do not match with the image, maybe the problem is because that pdf2image is doing some trick with the image size or something like that but I specified the size of the image correctly so I don't know. Any solutions / suggestions are more than welcome. Thanks in adavance.


Solution

  • I've checked the coordinates of first two boxes from first part of your code and they more or less fit to the text on the page:

    enter image description here

    But are you aware that zero point in PDF is placed in the bottom-left corner? Maybe this is the cause of the problem.

    Unfortunately I didn't managed to test the second part of the code. pdf2image gets me some error.

    But I'm almost sure that PIL.Image has zero point in top-left corner not like PDF. You can convert pdf_Y to pil_Y with formula:

    pil_Y = page_height - pdf_Y
    

    Page height in your case is 792 pt. And you can get page height with script as well.

    Coordinates

    enter image description here


    Update

    Nevertheless after a couple hours that I spend to install all of the modules (it was a hardest part!) I make your script to work to some extent.

    Basically I was right: coordinates were inverted y => h - y because PIL and PDF have different positions of zero point.

    And there was another thing. PIL makes images with resolution 200 dpi (probably it can be changed somewhere). PDF measures everything in points (1 pt = 1/72 dpi). So if you want to use PDF sizes in PIL, you need to change PDF sizes this way: x => x * 200 / 72.

    Here is the fixed code:

    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfdocument import PDFDocument
    from pdfminer.pdfpage import PDFPage
    from pdfminer.pdfpage import PDFTextExtractionNotAllowed
    from pdfminer.pdfinterp import PDFResourceManager
    from pdfminer.pdfinterp import PDFPageInterpreter
    from pdfminer.pdfdevice import PDFDevice
    from pdfminer.layout import LAParams
    from pdfminer.converter import PDFPageAggregator
    import pdfminer
    import os
    import pandas as pd
    import pdf2image
    import numpy as np
    import PIL
    from PIL import Image
    import io
    from pathlib import Path # it's just my favorite way to handle files
    
    # pdf path
    # pdf_path ="test.pdf"
    pdf_path = Path.cwd()/"Git From Bottom Up.pdf"
    
    
    # PART 1: GET LTBOXES COORDINATES IN THE IMAGE ----------------------
    # Open a PDF file.
    fp = open(pdf_path, 'rb')
    
    # Create a PDF parser object associated with the file object.
    parser = PDFParser(fp)
    
    # Create a PDF document object that stores the document structure.
    # Password for initialization as 2nd parameter
    document = PDFDocument(parser)
    
    # Check if the document allows text extraction. If not, abort.
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed
    
    # Create a PDF resource manager object that stores shared resources.
    rsrcmgr = PDFResourceManager()
    
    # Create a PDF device object.
    device = PDFDevice(rsrcmgr)
    
    # BEGIN LAYOUT ANALYSIS
    # Set parameters for analysis.
    laparams = LAParams()
    
    # Create a PDF page aggregator object.
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    
    # Create a PDF interpreter object.
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    
    # here is where i stored the data
    boxes_data = []
    page_sizes = []
    
    def parse_obj(lt_objs, verbose = 0):
        # loop over the object list
        for obj in lt_objs:
            # if it's a textbox, print text and location
            if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
                if verbose >0:
                    print("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text()))
                data_dict = {
                    "startX":round(obj.bbox[0]),"startY":round(obj.bbox[1]),
                    "endX":round(obj.bbox[2]),"endY":round(obj.bbox[3]),
                    "text":obj.get_text()}
                boxes_data.append(data_dict)
            # if it's a container, recurse
            elif isinstance(obj, pdfminer.layout.LTFigure):
                parse_obj(obj._objs)
    
    # loop over all pages in the document
    for page in PDFPage.create_pages(document):
        # read the page into a layout object
        interpreter.process_page(page)
        layout = device.get_result()
        # extract text from this object
        parse_obj(layout._objs)
        mediabox = page.mediabox
        mediabox_data = {"height":mediabox[-1], "width":mediabox[-2]}
        page_sizes.append(mediabox_data)
    
    # PART 2: NOW GET PAGE TO IMAGE -------------------------------------
    firstpage_size = page_sizes[0]
    firstpage_image = pdf2image.convert_from_path(pdf_path)[0] # without 'size=...'
    #show first page with the right size (at least the one that pdfminer says)
    # firstpage_image.show()
    firstpage_image.save("firstpage.png")
    
    # the magic numbers
    dpi = 200/72
    vertical_shift = 5 # I don't know, but it's need to shift a bit
    page_height = int(firstpage_size["height"] * dpi)
    
    # loop through boxes (we'll process only first page for now)
    for i, _ in enumerate(boxes_data):
    
        #first box data
        startX, startY, endX, endY, text = boxes_data[i].values()
    
        # correction PDF --> PIL
        startY = page_height - int(startY * dpi) - vertical_shift
        endY   = page_height - int(endY   * dpi) - vertical_shift
        startX = int(startX * dpi)
        endX   = int(endX   * dpi)
        startY, endY = endY, startY 
    
        # turn image into array
        image_array = np.array(firstpage_image)
        # get cropped box
        box = image_array[startY:endY,startX:endX,:]
        convert2pil_image = PIL.Image.fromarray(box)
        #show cropped box image
        # convert2pil_image.show()
        png = "crop_" + str(i) + ".png"
        convert2pil_image.save(png)
        #print this does not match with the text, means there's an error
        print(text)
    
    

    The code is almost all the same as yours. I just added correction of the coordinates and saving the PNG files instead of showing them.

    Output:

    enter image description here

    Gi from the bottom up
    
    Wed,  Dec 9
    
    by John Wiegley
    
    In my pursuit to understand Git, it’s been helpful for me to understand it from the bottom
    up — rather than look at it only in terms of its high-level commands. And since Git is so beauti-
    fully simple when viewed this way, I thought others might be interested to read what I’ve found,
    and perhaps avoid the pain I went through nding it.
    
    I used Git version 1.5.4.5 for each of the examples found in this document.
    
    1.  License
    2.  Introduction
    3.  Repository: Directory content tracking
    
    Introducing the blob
    Blobs are stored in trees
    How trees are made
    e beauty of commits
    A commit by any other name…
    Branching and the power of rebase
    4.  e Index: Meet the middle man
    
    Taking the index farther
    5.  To reset, or not to reset
    
    Doing a mixed reset
    Doing a so reset
    Doing a hard reset
    
    6.  Last links in the chain: Stashing and the reog
    7.  Conclusion
    8.  Further reading
    
    2
    3
    5
    6
    7
    8
    10
    12
    15
    20
    22
    24
    24
    24
    25
    27
    30
    31
    

    Of course the fixed code is a kinda prototype. Not for sale. )