Search code examples
pdfpdflatexqpdf

document trailer ID in PDF: why does it consist of two strings and how to extract it by a command line tool


Although there are tools to access PDF metadata like pdfinfo, I did not find a proper way to get the trailer id. Instead I use an editor and search... So my first question is, whether there is a command line tool to do that work for me.

Still, I find out and I wonder: The id has two parts and the trailer looks like so:

trailer << /Info 2 0 R /Root 1 0 R /Size 3656 
/ID [<2442556d3492442c8e034f4bf45c46d4><31415926535897932384626433832795>] >>

I wonder about the intention of the 2 parts id; the PDF spec does not tell anything about it. In my latex created PDFs, the two parts coincide.

I wonder also that there seems tools like qpdf writing lower case letters whereas others like latex compilers seem to use upper case. This makes equality tests difficult.

Even invocation of qpdf with SOURCE_DATE_EPOCH=hex number the result does not change. ... This is unlike for latex compilers.


Solution

  • Concerning your second question

    I wonder about the intention of the 2 parts id; the PDF spec does not tell anything about it. In my latex created PDFs, the two parts coincide.

    The PDF specification explains:

    14.4 File identifiers

    PDF file identifiers shall be defined by the ID entry in a PDF file’s trailer dictionary (see 7.5.5, "File trailer"). The value of this entry shall be an array of two byte strings. The first byte string shall be a permanent identifier based on the contents of the PDF file at the time it was originally created and shall not change when the PDF file is updated. The second byte string shall be a changing identifier based on the PDF file’s contents at the time it was last updated (see 7.5.6, "Incremental updates"). When a PDF file is first written, both identifiers shall be set to the same value. If the first identifier in the reference matches the first identifier in the referenced file’s ID entry, and the last identifier in the reference matches the last identifier in the referenced file’s ID entry, it is very likely that the correct and unchanged PDF file has been found. If only the first identifier matches, a different version of the correct PDF file has been found.

    (ISO 32000-2)

    Thus, the first part identifies the document across revisions and the second part identifies the individual revision.

    I wonder also that there seems tools like qpdf writing lower case letters whereas others like latex compilers seem to use upper case. This makes equality tests difficult.

    Hex strings can use lower case or upper case letters, even a mix thereof. Comparison of IDs must be ready to deal with that.