Search code examples
pythonpdfdata-miningpymupdf

Print all objects inside a PDF file with Python


I'd like to list all objects present in a PDF file: text blocks, images, fonts, page objects, but also vector shapes (if any).

I hoped to see all of them with PyMuPDF:

import fitz  # pip install PyMuPDF
doc = fitz.open('test.pdf')
for xref in range(1, doc.xref_length()):
    print(doc.xref_object(xref))

but not everything is there. For example, text is not there. Text can be obtained separately with:

print(doc.load_page(0).get_text('dict'))

but I'm more looking for a general method, rather than one specific for text elements, one for other objects, etc.

Question: how to print all objects present in a PDF file? (text blocks, images, vector shapes, etc.)

Notes:

  • I've already read How to extract text from a PDF file? and similar questions but this is specific to text, whereas I'm looking for all objects / attributes.

  • I already read How to open PDF raw? but here it did not help

  • When opening a PDF with a text editor, we see a lot of human-unreadable binary data (it seems that it is not only for images).

TL;DR: I'm looking for a representation like:

Object0
    TYPE:TEXT
    CONTENT:lorem ipsum
    POSITION:123,123

Object1
    TYPE:IMAGE
    ...

Object2
    TYPE:...
    ...

Solution

  • Bare with me, please.

    This isn't an answer but is really a complex comment in response to the overloaded use of the term "object" not only by the OP and commenters, but also by the PDF spec itself.

    PDF is really just JSON on steroids

    PDF has first-class support for booleans, integers, real numbers, strings, names, arrays, dictionaries, streams, and a singleton null object. But instead of describing the document as one giant dictionary, PDF allows defining objects with an object-id and referencing it later by the object-id. These are called indirect objects. The PDF document is actually just a bag of objects, with an index and pointer to the "root" object at the tail of the file.

    INDIRECT OBJECTS

    These objects in the PDF that have an object-id is what is typically meant by the informal use of the term objects in a PDF. These are used to describe the structure of the document and all the resources that are needed to produce the document. However these objects hold none of the actual content.

    The "objects" of a PDF document

    STREAMS hold the content

    Streams are used to hold a small postfix-based command language that is interpreted by the PDF viewer. Here is an example from https://brendanzagaeski.appspot.com/0004.html showing an actual valid snippet of PDF that shows an indirect object with object-id 4 and of type stream. My comments on the right.

    4 0 obj                 begin indirect object 4
      << /Length 55 >>      { 'Length': 55}
    stream                  begin stream type
      BT                        begin-text-object command
        /F1 18 Tf               change-font to font with descriptor F1 at size 18pt
        0 0 Td                  position-text at x=0, y=0
        (Hello World) Tj        render-text "Hello World"
      ET                        end-text-object command
    endstream               end stream type
    endobj                  end object
    

    GRAPHIC OBJECTS - the twist in the knickers

    The PDF spec refers to all of the elements instantiated by commands inside of a stream as "graphic objects". Yes even text objects are graphics objects. However these objects aren't declared with properties, they are defined by instructions on how to build them with an overarching state machine as shown below.

    Graphic objects state diagram

    THE PAIN

    So the twist, if you want all the graphics objects in the following form:

    { 'content': [
        { 'type': 'text', 'position': [0,0], 'text': "Hello World"
    ]}
    

    you have to build an interpreter to keep track of the graphics state and store away the objects as they get created when the commands are executed by the interpreter. A basic PDF viewer doesn't have to do this because the interpreter maps closely to the graphics api and the graphics state held by the graphics layer.

    So when you say objects...

    Do you mean:

    • Indirect objects
    • The document catalog in JSON format
    • All the graphics objects
    • All of the above

    References

    All images came out of the PDF specification

    https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf