Search code examples
pythonpdfextractimage-text

How to extract text associated with image from pdf?


I am using pymupdf to extract images from PDF. Code sample is as below.

import pymupdf

doc = pymupdf.open('sample.pdf')

page = doc[0] # get the page

image_list = page.get_images()

page_index = 0
for image_index, img in enumerate(image_list):
    xref = img[page_index] # get the XREF of the image
    pix = pymupdf.Pixmap(doc, xref) # create a Pixmap

    if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
        pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
    pix.save("page_%s-image_%s.png" % (page_index, image_index))

I am able to extract image extracted_image from sample pdf as

sample_pdf

Now I want to extract Text associated with Fig. 6.1, which suppose to return Fig. 6.1 Insect bites. Linear pruritic papules with central crusts demonstrating the “breakfast, lunch, and dinner” sign. Courtesy Antonio Torrelo, MD. only.

I tried page.get_text("block") and page.get_text() but not sure how I can relate Fig. 6.1 text only with the extracted image ?


Solution

  • When text is placed in a PDF it can be in any order and sequence. So the best possibility is it is exported in the sequence as seen top down. However images and texts are generally unrelated.

    Acrobat and many Python extractors can do similar, by changing page text order into as laid out. But there is no indication of imbedded images, diagrams or other graphics such as tables. So the best we know without metadata is there is text and there are images.

    enter image description here

    What you need to extract are the co-ordinates of text and images for mathematic comparison. One way is to inject spaces in-between the letters to find larger gaps. But without co-ordinates that is still a fragile strategy. enter image description here

    Likewise reversing to HTML or reading maths values has its problems too. The numbers simply do not on their own seem to make sense. The Fig. 6.1 Insect bites. text is higher (lower value) than the image. enter image description here

    And the reason for that is the image has a transformation that means its units are less than as shown.

    So for "determining" where text and images are in conjunction with each other. You need to dig deep into the page relative positioning based on current matrix and current transforms.

    PyMuPDf (MuPDF/Fitz) is good for those low level calculations, but you (as the driver) need to determine the methodologies you wish to apply for different cases.

    Is the text above below or alongside the images? Is the text before Table 6.1 or after image 6.2?

    Unless there is some form of boxed boundary encompassing 2 objects they are totally unrelated, except by readers subjectivity, as based on proximity and human understanding.

    The reader's subjectivity shapes their engagement with the narrative, and different readers may derive distinct meanings from the same text.