Search code examples
pythonpdfstreampypdf

pyPdf for IndirectObject extraction


Following this example, I can list all elements into a pdf file

import pyPdf
pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
list(pdf.pages) # Process all the objects.
print pdf.resolvedObjects

now, I need to extract a non-standard object from the pdf file.

My object is the one named MYOBJECT and it is a string.

The piece printed by the python script that concernes me is:

{'/MYOBJECT': IndirectObject(584, 0)}

The pdf file is this:

558 0 obj
<</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0 R/Resources
  <</ColorSpace <</CS0 563 0 R>>
    /ExtGState <</GS0 568 0 R>>
    /Font<</TT0 559 0 R/TT1 560 0 R/TT2 561 0 R/TT3 562 0 R>>
    /ProcSet[/PDF/Text/ImageC]
    /Properties<</MC0<</MYOBJECT 584 0 R>>/MC1<</SubKey 582 0 R>> >>
    /XObject<</Im0 578 0 R>>>>
  /Rotate 0/StructParents 0/Type/Page>>
endobj
...
...
...
584 0 obj
<</Length 8>>stream

1_22_4_1     --->>>>  this is the string I need to extract from the object

endstream
endobj

How can I follow the 584 value in order to refer to my string (under pyPdf of course)??


Solution

  • each element in pdf.pages is a dictionary, so assuming it's on page 1, pdf.pages[0]['/MYOBJECT'] should be the element you want.

    You can try to print that individually or poke at it with help and dir in a python prompt for more about how to get the string you want

    Edit:

    after receiving a copy of the pdf, i found the object at pdf.resolvedObjects[0][558]['/Resources']['/Properties']['/MC0']['/MYOBJECT'] and the value can be retrieved via getData()

    the following function gives a more generic way to solve this by recursively looking for the key in question

    import types
    import pyPdf
    pdf = pyPdf.PdfFileReader(open('file.pdf'))
    pages = list(pdf.pages)
    
    def findInDict(needle,haystack):
        for key in haystack.keys():
            try:
                value = haystack[key]
            except:
                continue
            if key == needle:
                return value
            if type(value) == types.DictType or isinstance(value,pyPdf.generic.DictionaryObject):  
                x = findInDict(needle,value)
                if x is not None:
                    return x
    
    answer = findInDict('/MYOBJECT',pdf.resolvedObjects).getData()