Search code examples
pythonpdfpypdf

Duplicating PDF with PyPDF2 gives blank pages


I'm using PyPDF2 to alter a PDF document (adding bookmarks). So I need to read in the entire source PDF, and write it out, keeping as much of the data intact as possible. Merely writing each page into a new PDF object may not be sufficient to preserve document metadata.

PdfFileWriter() does have a number of methods for copying an entire file: cloneDocumentFromReader, appendPagesFromReader and cloneReaderDocumentRoot. However, they all have problems.

If I use cloneDocumentFromReader or appendPagesFromReader, I get a valid PDF file, with the correct number of pages, but all pages are blank.

If I use cloneReaderDocumentRoot, I get a minimal valid PDF file, but with no pages or data.

This has been asked before, but with no successful answers. Other questions have asked about Blank pages in PyPDF2, but I can't apply the answer given.

Here's my code:

def bookmark(incomingFile):
    reader = PdfFileReader(incomingFile)
    writer = PdfFileWriter()

    writer.appendPagesFromReader(reader)
    #writer.cloneDocumentFromReader(reader)
    my_table_of_contents = [
            ('Page 1', 0), 
            ('Page 2', 1),
            ('Page 3', 2)
            ]
    # writer.addBookmark(title, pagenum, parent=None, color=None, bold=False, italic=False, fit='/Fit')
    for title, pagenum in my_table_of_contents:
        writer.addBookmark(title, pagenum, parent=None)

    writer.setPageMode("/UseOutlines")

    with open(incomingFile, "wb") as fp:
        writer.write(fp)

I tend to get errors when PyPDF2 can't add a bookmark to the PdfFileWriter object, because it doesn't have any pages, or similar.


Solution

  • I also wrestled with this a lot, finally found that PyPDF2 has this issue. Basically I copied this answer's code into C:\ProgramData\Anaconda3\lib\site-packages\PyPDF2\pdf.py (this will depend on your distribution) around line 382 for the cloneDocumentFromReader function.

    After that I was able to append the reader pages to the writer with writer.cloneDocumentFromReader(pdf) and, in my case, to update PDF Metadata (Subject, Keywords, etc.).

    Hope this helps you

        '''
        Create a copy (clone) of a document from a PDF file reader
    
        :param reader: PDF file reader instance from which the clone
            should be created.
        :callback after_page_append (function): Callback function that is invoked after
            each page is appended to the writer. Signature includes a reference to the
            appended page (delegates to appendPagesFromReader). Callback signature:
    
            :param writer_pageref (PDF page reference): Reference to the page just
                appended to the document.
        '''
        debug = False
        if debug:
            print("Number of Objects: %d" % len(self._objects))
            for obj in self._objects:
                print("\tObject is %r" % obj)
                if hasattr(obj, "indirectRef") and obj.indirectRef != None:
                    print("\t\tObject's reference is %r %r, at PDF %r" % (obj.indirectRef.idnum, obj.indirectRef.generation, obj.indirectRef.pdf))
    
        # Variables used for after cloning the root to
        # improve pre- and post- cloning experience
    
        mustAddTogether = False
        newInfoRef = self._info
        oldPagesRef = self._pages
        oldPages = self.getObject(self._pages)
    
        # If there have already been any number of pages added
    
        if oldPages[NameObject("/Count")] > 0:
    
            # Keep them
    
            mustAddTogether = True
        else:
    
            # Through the page object out
    
            if oldPages in self._objects:
                newInfoRef = self._pages
                self._objects.remove(oldPages)
    
        # Clone the reader's root document
    
        self.cloneReaderDocumentRoot(reader)
        if not self._root:
            self._root = self._addObject(self._root_object)
    
        # Sweep for all indirect references
    
        externalReferenceMap = {}
        self.stack = []
        newRootRef = self._sweepIndirectReferences(externalReferenceMap, self._root)
    
        # Delete the stack to reset
    
        del self.stack
    
        #Clean-Up Time!!!
    
        # Get the new root of the PDF
    
        realRoot = self.getObject(newRootRef)
    
        # Get the new pages tree root and its ID Number
    
        tmpPages = realRoot[NameObject("/Pages")]
        newIdNumForPages = 1 + self._objects.index(tmpPages)
    
        # Make an IndirectObject just for the new Pages
    
        self._pages = IndirectObject(newIdNumForPages, 0, self)
    
        # If there are any pages to add back in
    
        if mustAddTogether:
    
            # Set the new page's root's parent to the old
            # page's root's reference
    
            tmpPages[NameObject("/Parent")] = oldPagesRef
    
            # Add the reference to the new page's root in
            # the old page's kids array
    
            newPagesRef = self._pages
            oldPages[NameObject("/Kids")].append(newPagesRef)
    
            # Set all references to the root of the old/new
            # page's root
    
            self._pages = oldPagesRef
            realRoot[NameObject("/Pages")] = oldPagesRef
    
            # Update the count attribute of the page's root
    
            oldPages[NameObject("/Count")] = NumberObject(oldPages[NameObject("/Count")] + tmpPages[NameObject("/Count")])
    
        else:
    
            # Bump up the info's reference b/c the old
            # page's tree was bumped off
    
            self._info = newInfoRef