Search code examples
pdfpdf-generationpostscript

Is there a way discard previous pdfmark metadata?


I was trying to automate adding title, bookmarks and such to some PDFs I need. The way I came up with was to create a simple pdfmark script like this:

% pdfmark.ps
[ /Title (My document)
  /Author(Me)
  /DOCINFO pdfmark

[ /Title (First chapter)
  /Page 1
  /OUT pdfmark

Then generate a new PDF with ghostscript using:

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=out.pdf in.pdf pdfmark.ps

If in.pdf doesn't have any pdfmark data it works fine, however if it does things don't work out nicely: for example title/author aren't modified and bookmarks are appended instead of replaced.

Since I don't want to mess around modifying the PDF's corresponding postscript, I was trying to find if there is some command to add to pdfmark.ps that can delete (or overwrite) previous metadata.


Solution

  • I'll leave PostScript to others and show how to remove a PDF outline using the qpdf package (for qpdf and fix-qdf) and GNU sed.

    From the qpdf manual:

    In QDF mode, qpdf creates PDF files in what we call QDF form. A PDF file in QDF form, sometimes called a QDF file, is a completely valid PDF file that has %QDF-1.0 as its third line (after the pdf header and binary characters) and has certain other characteristics. The purpose of QDF form is to make it possible to edit PDF files, with some restrictions, in an ordinary text editor.

    (For a non-GNU/Linux system adapt the commands below.)

    qpdf --qdf --compress-streams=n --decode-level=generalized \
         --object-streams=disable -- in.pdf - |
    sed --binary \
        -e '/^[ ][ ]*\/Outlines [0-9][0-9]* [0-9] R/ s/[1-9]/0/g' |
    fix-qdf > tmp.qdf
    qpdf --coalesce-contents --compression-level=9 \
         --object-streams=generate -- tmp.qdf out.pdf
    

    where:

    • 1st qpdf command converts the PDF file to QDF form for editing
    • sed orphans outlines in the QDF file by rooting them at non-existing obj 0
    • fix-qdf repairs the QDF after editing
    • 2nd qpdf converts and compresses QDF to PDF
    • qpdf input cannot be pipelined, it needs to seek

    The sed command changes digits to zeros in the line containing the indented text /Outlines. Note that GNU sed is used for the non-standard --binary option to avoid mishaps on an OS distinguishing between text and binary files. Similarly, to strip annotations replace /Outlines with /Annots in the -e above, or insert it in a second -e option to do both. Another patch utility than sed will do; often just one byte has to be changed.

    To quickly strip all non-page data (docinfo, outlines a.o. but not annotations) qpdf's --empty option may be useful:

    qpdf --coalesce-contents --compression-level=9 \
         --object-streams=generate \
         --empty --pages in.pdf 1-z -- out.pdf