Search code examples
pythonimagepdfextractpypdf

Extract an image from a PDF in python


I'm trying to extract images from a pdf using PyPDF2, but when my code gets it, the image is very different from what it should actually look like, look at the example below:

Text But this is how it should really look like:

Text

Here's the pdf I'm using:

https://www.hbp.com/resources/SAMPLE%20PDF.pdf

Here's my code:

pdf_filename = "SAMPLE.pdf"
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
page = cond_scan_reader.getPage(0)

xObject = page['/Resources']['/XObject'].getObject()
i = 0
for obj in xObject:
    # print(xObject[obj])
    if xObject[obj]['/Subtype'] == '/Image':
        if xObject[obj]['/Filter'] == '/DCTDecode':
            data = xObject[obj]._data
            img = open("{}".format(i) + ".jpg", "wb")
            img.write(data)
            img.close()
            i += 1 

And since I need to keep the image in it's colour mode, I can't just convert it to RBG if it was CMYK because I need that information. Also, I'm trying to get dpi from images I get from a pdf, is that information always stored in the image? Thanks in advance


Solution

  • I used pdfreader to extract the image from your example. The image uses ICCBased colorspace with the value of N=4 and Intent value of RelativeColorimetric. This means that the "closest" PDF colorspace is DeviceCMYK.

    All you need is to convert the image to RGB and invert the colors.

    Here is the code:

    from pdfreader import SimplePDFViewer
    import PIL.ImageOps 
    
    fd = open("SAMPLE PDF.pdf", "rb")
    viewer = SimplePDFViewer(fd)
    
    viewer.render()
    img = viewer.canvas.images['Im0']
    
    # this displays ICCBased 4 RelativeColorimetric
    print(img.ColorSpace[0], img.ColorSpace[1].N, img.Intent)
    
    pil_image = img.to_Pillow()
    pil_image = pil_image.convert("RGB")
    inverted = PIL.ImageOps.invert(pil_image)
    
    
    inverted.save("sample.png")
    

    Read more on PDF objects: Image (sec. 8.9.5), InlineImage (sec. 8.9.7)