Search code examples
pythonpdfpython-imaging-librarytiffpypdf

Extracting image from a PDF using PyPDF without a "/Filter" tag in the xObject


Currently using something like this to extract images from a PDF:

import PyPDF4
from PIL import Image
from pathlib import Path
import os

PDFFilePath = Path("somefile.pdf")
OutputFolder = "somedirectory"
pdfpage = 0

with open(PDFFilePath,'rb') as pdf_reader:
    pdf_object = PyPDF4.PdfFileReader(pdf_reader)
    PageFolder = Path(OutputFolder).joinpath(Path(PDFFilePath.stem + '.'+ str(pdfPage)))
    if not PageFolder.exists():
        os.makedirs(PageFolder)

    CurrentPage = pdf_object.getPage(pdfPage)
    xObject = CurrentPage['/Resources']['/XObject'].getObject()

    for obj_index,obj in enumerate(xObject):
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            if xObject[obj]['/Filter'] == '/FlateDecode':
                img = Image.frombytes(mode, size, data)
                img.save(PageFolder.joinpath(Path(PDFFilePath).stem +"."+ str(pdfPage) + "."+ str(obj_index) + ".png"),'wb')
            elif xObject[obj]['/Filter'] == '/DCTDecode':
                img = open(PageFolder.joinpath(Path(PDFFilePath).stem +"."+ str(pdfPage) + "."+ str(obj_index)+ ".jpg"),'wb')
                img.write(data)
                img.close()
            elif xObject[obj]['/Filter'] == '/JPXDecode':
                img = open(PageFolder.joinpath(Path(PDFFilePath).stem +"."+ str(pdfPage) + "."+ str(obj_index)+ ".jp2"),'wb')
                img.write(data)
                img.close()
            elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                img = open(PageFolder.joinpath(Path(PDFFilePath).stem +"."+ str(pdfPage) + "."+ str(obj_index)+ ".tiff"),'wb')
                img.write(data)
                img.close()

I encountered a bunch of PDFs without the '/Filter' in the xObject[obj]['/Filter'] part. I tried extracting the raw images from the data = xObject[obj].getdata() through Pillow but throws an error that it "does not have enough data". OpenCV returns None if using cv2.imdecode

The PDFs given are confidential so I cannot give a sample.

A solution still using PyPDF4 would be nice.

EDIT: OpenCV image reader

The OpenCV part (i deleted it from the code, will go if the '/Filter' is not detected)

cv_color_space = cv2.IMREAD_COLOR if mode == "RGB" else cv2.IMREAD_GRAYSCALE
buf = np.frombuffer(data,np.uint8)
img = cv2.imdecode(buf,cv_color_space)
cv2.imwrite("outputfile.png",img)

Solution

  • The images are apparently .tiff images but do not have a header. I found this: https://stackoverflow.com/a/34555343/13919892

    I added this function to my code:

    import struct
    
    def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):
        tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
        return struct.pack(tiff_header_struct,
                           b'II',  # Byte order indication: Little indian
                           42,  # Version number (always 42)
                           8,  # Offset to first IFD
                           8,  # Number of tags in IFD
                           256, 4, 1, width,  # ImageWidth, LONG, 1, width
                           257, 4, 1, height,  # ImageLength, LONG, 1, lenght
                           258, 3, 1, 1,  # BitsPerSample, SHORT, 1, 1
                           259, 3, 1, CCITT_group,  # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
                           262, 3, 1, 0,  # Threshholding, SHORT, 1, 0 = WhiteIsZero
                           273, 4, 1, struct.calcsize(tiff_header_struct),  # StripOffsets, LONG, 1, len of header
                           278, 4, 1, height,  # RowsPerStrip, LONG, 1, lenght
                           279, 4, 1, img_size,  # StripByteCounts, LONG, 1, size of image
                           0  # last IFD
                           )
    

    then added this to my code:

    if not '/Filter' in xObject[obj]:
        tiff_header = tiff_header_for_CCITT(size[0],size[1],len(data),1) # Using the group "1" because it works for some reason
        inv_data = bytes((~bit + 256 for bit in data)) # for some reason the bits are inverted?
        tiff_data = tiff_header + inv_data # Add the header to the inverted data
        # Write the tiff file
        img = open(PageFolder.joinpath(Path(PDFFilePath).stem +"."+ str(pdfPage) + "."+ str(obj_index)+ ".tiff"),'wb')
        img.write(tiff_data)
        img.close()
        continue
    

    I need to know though how to identify if the bits need to be inverted or what "CCITT Group" to use.

    I'll mark this as the answer and maybe just open a new question for this.