Search code examples
pythonpdfpdfrw

Working with streams in PDFrw for Python?


I'm trying to read in an example PDF with PDFrw. The PDF contains the phrase Hello Matthew in the bottom left corner at coordinates (100, 100). When I attempt to output the text (if I even can?) I get a stream of data. I can't seem to figure out how to get that as text.

>>> import pdfrw

>>> file_object = pdfrw.PdfReader("Hello.pdf")
>>> file_object
{'/ID': ['<f643bc0910dfb67725d53e11054f4609>', '<f643bc0910dfb67725d53e11054f4609>'], '/Info': (5, 0), '/Root': {'/Outl
ines': (8, 0), '/PageMode': '/UseNone', '/Pages': {'/Count': '1', '/Kids': [{'/Contents': (7, 0), '/MediaBox': ['0', '0
', '595.2756', '841.8898'], '/Parent': {...}, '/Resources': {'/Font': (1, 0), '/ProcSet': ['/PDF', '/Text', '/ImageB',
'/ImageC', '/ImageI']}, '/Rotate': '0', '/Trans': {}, '/Type': '/Page'}], '/Type': '/Pages'}, '/Type': '/Catalog'}, '/S
ize': '9'}

>>> file_object.pages[0]
{'/Contents': (7, 0), '/MediaBox': ['0', '0', '595.2756', '841.8898'], '/Parent': {'/Count': '1', '/Kids': [{...}], '/T
ype': '/Pages'}, '/Resources': {'/Font': (1, 0), '/ProcSet': ['/PDF', '/Text', '/ImageB', '/ImageC', '/ImageI']}, '/Rot
ate': '0', '/Trans': {}, '/Type': '/Page'}

>>> file_object.pages[0].keys()
['/Contents', '/MediaBox', '/Parent', '/Resources', '/Rotate', '/Trans', '/Type']

>>> file_object.pages[0].Contents
{'/Filter': ['/ASCII85Decode', '/FlateDecode'], '/Length': '102'}

>>> file_object.pages[0].Contents.stream
'GapQh0E=F,0U\\H3T\\pNYT^QKk?tc>IP,;W#U1^23ihPEM_?CW4KISi90EC-p>QkRte=<%V"lI7]P)Rn29neZ[Kb,htEWn&q7Q2"V~>'

Solution

  • That stream is compressed. You can tell that by the dictionary /Filter parameter.

    Unfortunately, pdfrw does not (yet?) know how to decompress with that type of filter. If you run your pdf through something like pdftk first to decompress it, you might see something more reasonable.

    Disclaimer: I am the primary pdfrw author.

    But...

    Even then, especially for non-ASCII fonts, character to glyph mapping in PDFs is complicated, so you won't always see something that looks reasonable.

    If you really want to deeply examine text PDF files, pdfminer might be more useful -- pdfrw has not yet really grown the tools to do that too well.