I'd like to list all objects present in a PDF file: text blocks, images, fonts, page objects, but also vector shapes (if any).
I hoped to see all of them with PyMuPDF:
import fitz # pip install PyMuPDF
doc = fitz.open('test.pdf')
for xref in range(1, doc.xref_length()):
print(doc.xref_object(xref))
but not everything is there. For example, text is not there. Text can be obtained separately with:
print(doc.load_page(0).get_text('dict'))
but I'm more looking for a general method, rather than one specific for text elements, one for other objects, etc.
Question: how to print all objects present in a PDF file? (text blocks, images, vector shapes, etc.)
Notes:
I've already read How to extract text from a PDF file? and similar questions but this is specific to text, whereas I'm looking for all objects / attributes.
I already read How to open PDF raw? but here it did not help
When opening a PDF with a text editor, we see a lot of human-unreadable binary data (it seems that it is not only for images).
TL;DR: I'm looking for a representation like:
Object0
TYPE:TEXT
CONTENT:lorem ipsum
POSITION:123,123
Object1
TYPE:IMAGE
...
Object2
TYPE:...
...
Bare with me, please.
This isn't an answer but is really a complex comment in response to the overloaded use of the term "object" not only by the OP and commenters, but also by the PDF spec itself.
PDF has first-class support for booleans, integers, real numbers, strings, names, arrays, dictionaries, streams, and a singleton null object. But instead of describing the document as one giant dictionary, PDF allows defining objects with an object-id and referencing it later by the object-id. These are called indirect objects. The PDF document is actually just a bag of objects, with an index and pointer to the "root" object at the tail of the file.
These objects in the PDF that have an object-id is what is typically meant by the informal use of the term objects in a PDF. These are used to describe the structure of the document and all the resources that are needed to produce the document. However these objects hold none of the actual content.
Streams are used to hold a small postfix-based command language that is interpreted by the PDF viewer. Here is an example from https://brendanzagaeski.appspot.com/0004.html showing an actual valid snippet of PDF that shows an indirect object with object-id 4 and of type stream. My comments on the right.
4 0 obj begin indirect object 4
<< /Length 55 >> { 'Length': 55}
stream begin stream type
BT begin-text-object command
/F1 18 Tf change-font to font with descriptor F1 at size 18pt
0 0 Td position-text at x=0, y=0
(Hello World) Tj render-text "Hello World"
ET end-text-object command
endstream end stream type
endobj end object
The PDF spec refers to all of the elements instantiated by commands inside of a stream as "graphic objects". Yes even text objects are graphics objects. However these objects aren't declared with properties, they are defined by instructions on how to build them with an overarching state machine as shown below.
So the twist, if you want all the graphics objects in the following form:
{ 'content': [
{ 'type': 'text', 'position': [0,0], 'text': "Hello World"
]}
you have to build an interpreter to keep track of the graphics state and store away the objects as they get created when the commands are executed by the interpreter. A basic PDF viewer doesn't have to do this because the interpreter maps closely to the graphics api and the graphics state held by the graphics layer.
Do you mean:
All images came out of the PDF specification
https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf