Search code examples
pythonpdfpypdfpdfrw

Merging PDFs while retaining custom page numbers (aka pagelabels) and bookmarks


I'm trying to automate merging several PDF files and have two requirements: a) existing bookmarks AND b) pagelabels (custom page numbering) need to be retained.

Retaining bookmarks when merging happens by default with PyPDF2 and pdftk, but not with pdfrw. Pagelabels are consistently not retained in PyPDF2, pdftk or pdfrw.

I am guessing, after having searched a lot, that there is no straightforward approach to doing what I want. If I'm wrong then I hope someone can point to this easy solution. But, if there is no easy solution, any tips on how to get this going in python will be much appreciated!

Some example code:

1) With PyPDF2

from PyPDF2 import PdfFileWriter, PdfFileMerger, PdfFileReader 
tmp1 = PdfFileReader('file1.pdf', 'rb')
tmp2 = PdfFileReader('file2.pdf', 'rb')
#extracting pagelabels is easy
pl1 = tmp1.trailer['/Root']['/PageLabels']
pl2 = tmp2.trailer['/Root']['/PageLabels']
#but PdfFileWriter or PdfFileMerger does not support writing from what I understand

So I dont know how to proceed from here

2) With pdfrw (has more promise)

from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
#read 1st file
tmp1 = PdfReader('file1')
#add the pages
writer.addpages(tmp1.pages)
#copy bookmarks to writer
writer.trailer.Root.Outlines = tmp1.Root.Outlines
#copy pagelabels to writer
writer.trailer.Root.PageLabels = tmp1.Root.PageLabels
#read second file
tmp2 = PdfReader('file2')
#append pages
writer.addpages(tmp2.pages)
# so far so good

Page numbers of bookmarks from 2nd file need to be offset before adding them, but when reading outlines I almost always get (IndirectObject, XXX) instead of page numbers. Its unclear how to get page numbers for each label and bookmark using pdfrw. So, I'm stuck again

zp


Solution

  • You need to iterate through the existing PageLabels and add them to the merged output, taking care to add an offset to the page index entry, based on the number of pages already added.

    This solution also requires PyPDF4, since PyPDF2 produces a weird error (see bottom).

    from PyPDF4 import PdfFileWriter, PdfFileMerger, PdfFileReader 
    
    # To manipulate the PDF dictionary
    import PyPDF4.pdf as PDF
    
    import logging
    
    def add_nums(num_entry, page_offset, nums_array):
        for num in num_entry['/Nums']:
            if isinstance(num, (int)):
                logging.debug("Found page number %s, offset %s: ", num, page_offset)
    
                # Add the physical page information
                nums_array.append(PDF.NumberObject(num+page_offset))
            else:
                # {'/S': '/r'}, or {'/S': '/D', '/St': 489}
                keys = num.keys()
                logging.debug("Found page label, keys: %s", keys)
                number_type = PDF.DictionaryObject()
                # Always copy the /S entry
                s_entry = num['/S']
                number_type.update({PDF.NameObject("/S"): PDF.NameObject(s_entry)})
                logging.debug("Adding /S entry: %s", s_entry)
    
                if '/St' in keys:
                    # If there is an /St entry, fetch it
                    pdf_label_offset = num['/St']
                    # and add the new offset to it
                    logging.debug("Found /St %s", pdf_label_offset)
                    number_type.update({PDF.NameObject("/St"): PDF.NumberObject(pdf_label_offset)})
    
                # Add the label information
                nums_array.append(number_type)
    
        return nums_array
    
    def write_merged(pdf_readers):
        # Output
        merger = PdfFileMerger()
    
        # For PageLabels information
        page_labels = []
        page_offset = 0
        nums_array = PDF.ArrayObject()
    
        # Iterate through all the inputs
        for pdf_reader in pdf_readers:
            try:
                # Merge the content
                merger.append(pdf_reader)
    
                # Handle the PageLabels
                # Fetch page information
                old_page_labels = pdf_reader.trailer['/Root']['/PageLabels']
                page_count = pdf_reader.getNumPages()
    
                # Add PageLabel information
                add_nums(old_page_labels, page_offset, nums_array)
                page_offset = page_offset + page_count
    
            except Exception as err:
                print("ERROR: %s" % err)
    
        # Add PageLabels
        page_numbers = PDF.DictionaryObject()
        page_numbers.update({PDF.NameObject("/Nums"): nums_array})
    
        page_labels = PDF.DictionaryObject()
        page_labels.update({PDF.NameObject("/PageLabels"): page_numbers})
    
        root_obj = merger.output._root_object
        root_obj.update(page_labels)
    
        # Write output
        merger.write('merged.pdf')
    
    
    pdf_readers = []
    tmp1 = PdfFileReader('file1.pdf', 'rb')
    tmp2 = PdfFileReader('file2.pdf', 'rb')
    pdf_readers.append(tmp1)
    pdf_readers.append(tmp2)
    
    write_merged(pdf_readers)
    

    Note: PyPDF2 produces this weird error:

      ...
      ...
      File "/usr/lib/python3/dist-packages/PyPDF2/pdf.py", line 552, in _sweepIndirectReferences
        data[key] = value
      File "/usr/lib/python3/dist-packages/PyPDF2/generic.py", line 507, in __setitem__
        raise ValueError("key must be PdfObject")
    ValueError: key must be PdfObject