Search code examples
pdfitextlast-modified

Why does the PDF /ID field not match the last created and last modified dates?


In the PDF structure, the trailer typically contains information regarding the date when a document was created and the date of when it was last modified. If these two match, we will know that this document has been untouched. However, I have also encountered examples of PDFs where the last modified and created dates match, but the /ID field (containing two hashes of the documents) suggests otherwise.

Since the ID field is [<hash of the document when created>, <hash of the document when modified>], shouldn't the two IDs also match when the dates are the same?


Solution

  • Concerning your question:

    information regarding the date when a document was created and the date of when it was last modified. If these two match, we will know that this document has been untouched. [...]

    shouldn't the two IDs also match when the dates are the same?

    No. Sometimes PDFs are processed right after being generated, so the time of creation may be identical to the time of the modification if the processing step is quick. Also, clocks on different computers may be somewhat off, so if creation and modification take place on different computers, local creation and modification times can be identical even if they don't happen in the same second. Modification time can even be before creation time!

    Furthermore, the dates are optional. If the original PDF was created without a creation date, the processing step may add both time entries with the same value.

    Some asides:

    In the PDF structure, the trailer typically contains information regarding the date when a document was created and the date of when it was last modified.

    Strictly speaking the dates are in the document information dictionary which is specified to be referenced from the trailer via an indirect reference, so it's not even contained in the trailer as direct object.

    Since the ID field is [<hash of the document when created>, <hash of the document when modified>]

    The ID parts are not specified to be hashes of the document at certain moments. Actually the specification mentions an example approach to create the identifiers which doesn't take the document contents into account at all.

    I have also encountered examples of PDFs where [...]

    In PDFs in the wild you often find deviations from the specification, in particular in the information that are optional and purely metadata.