Search code examples
pythonpypdfpython-zipfile

Reading a pdf from a zipfile


I'm trying to get PyPDF2 to read a small .pdf file that is within a simple zip file. Here's what I've got so far:

import PyPDF2,zipfile

with zipfile.ZipFile("TEST.zip") as z:
    filename = z.namelist()[0]
    a = z.filelist[0]
    b = z.open(filename)
    c = z.read(filename)
    PyPDF2.PdfFileReader(b)

Error Message:

PdfReadWarning: PdfFileReader stream/file object is not in binary mode. It may not be read correctly. [pdf.py:1079] io.UnsupportedOperation: seek


Solution

  • The file is not yet extracted so you can't operate on it with open().

    That's ok, though, because PdfFileReader wants a stream; so we can provide it using BytesIO. The below example takes the decompressed bytes, and provides them to BytesIO which makes them into a stream for PdfFileReader. If you left out BytesIO you'd get: AttributeError: 'bytes' object has no attribute 'seek'.

    import PyPDF2,zipfile
    from io import BytesIO                             
    
    with zipfile.ZipFile('sample.zip','r') as z: 
        filename = z.namelist()[0] 
        pdf_file = PyPDF2.PdfFileReader(BytesIO(z.read(filename))) 
    

    Result:

    In [20]: pdf_file
    Out[20]: <PyPDF2.pdf.PdfFileReader at 0x7f01b61db2b0>
    
    In [21]: pdf_file.getPage(0)
    Out[21]: 
    {'/Type': '/Page',
     '/Parent': {'/Type': '/Pages',
      '/Count': 2,
      '/Kids': [IndirectObject(4, 0), IndirectObject(6, 0)]},
     '/Resources': {'/Font': {'/F1': {'/Type': '/Font',
        '/Subtype': '/Type1',
        '/Name': '/F1',
        '/BaseFont': '/Helvetica',
        '/Encoding': '/WinAnsiEncoding'}},
      '/ProcSet': ['/PDF', '/Text']},
     '/MediaBox': [0, 0, 612, 792],
     '/Contents': {}}