Search code examples
pdfpdftk

How can I drop metadata fields (e.g., PageLabel fields) from PDFs?


I have used pdftk to change the "Info" metadata associated with a PDF. I currently have several PDFs with extraneous page labels and I cannot figure how to drop them. This is what I am currently doing:

$ pdftk example_orig.pdf dump_data output page_labels.orig
$ grep -v PageLabel page_labels.orig > page_labels.new
$ pdftk example_orig.pdf update_info page_labels.new output example_new.pdf

This does not remove the PageLabel* metadata which can be verified with:

$ pdftk example_orig.pdf dump_data | grep PageLabel

How can I programmatically remove this metadata from the PDF? It would be nice to do with with pdftk but if there another tool or way to do this on GNU/Linux, that would also work for me.

I need this because I am using LaTeX Beamer to generate presentations with the \setbeameroption{show notes on second screen} option which generates a double-width PDF for showing notes on a second screen. Unfortunately, there seems to be a bug in pgfpages which results in incorrect and extraneous PageLabels in these files (example). If I generate a slides only PDF, it will generates the correct PageLabels (example). Since I can generate a correct set of PageLabels, one solution would be to replace the pagelabels in the first examples with those in the second. That said, since there are extra pagelabels in the first example, I would need to remove them first.


Solution

  • Not sure if I correctly understood the problem. You can try with a butcher's solution: brute force replace the /PageLabels block with a different one which will not be recognized.

    # Get a readable/writable PDF
    pdftk file1.pdf output temp.pdf uncompress
    
    # Mangle the PDF. Keep same length
    sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf
    
    # Recompress
    pdftk mangled.pdf output final.pdf compress
    
    rm -f temp.pdf mangled.pdf