Search code examples
pythonpdfpdf-generationpypdf

Merging PDF's with PyPDF2 with inputs based on file iterator


I have two folders with PDF's of identical file names. I want to iterate through the first folder, get the first 3 characters of the filename, make that the 'current' page name, then use that value to grab the 2 corresponding PDF's from both folders, merge them, and write them to a third folder.

The script below works as expected for the first iteration, but after that, the subsequent merged PDF's include all the previous ones (ballooning quickly to 72 pages within 8 iterations).

Some of this could be due to poor code, but I can't figure out where that is, or how to clear the inputs/outputs that could be causing the failure to write only 2 pages per iteration:

import os
from PyPDF2 import PdfFileMerger
merger = PdfFileMerger()

rootdir = 'D:/Python/Scatterplots/BoundaryEnrollmentPatternMap'

for subdir, dirs, files in os.walk(rootdir):
    for currentPDF in files:
        #print os.path.join(file[0:3])
        pagename = os.path.join(currentPDF[0:3])
        print "pagename is: " + pagename
        print "File is: " + pagename + ".pdf"
        input1temp = 'D:/Python/Scatterplots/BoundaryEnrollmentPatternMap/' + pagename + '.pdf'
        input2temp = 'D:/Python/Scatterplots/TraditionalScatter/' + pagename + '.pdf'
        input1 = open(input1temp, "rb")
        input2 = open(input2temp, "rb")
        merger.append(fileobj=input1, pages=(0,1))
        merger.append(fileobj=input2, pages=(0,1))
        outputfile = 'D:/Python/Scatterplots/CombinedMaps/Sch_' + pagename + '.pdf'

        print merger.inputs

        output = open(outputfile, "wb")
        merger.write(output)
        output.close()

        #clear all inputs - necessary?
        outputfile = []
        output = []
        merger.inputs = []
        input1temp = []
        input2temp = []
        input1 = []
        input2 = []

print "done"

My code / work is based on this sample:

https://github.com/mstamy2/PyPDF2/blob/master/Sample_Code/basic_merging.py


Solution

  • I think that the error is that merger is initialized before the loop and it accumulates all the documents. Try to move line merger = PdfFileMerger() into the loop body. merger.inputs = [] doesn't seem to help in this case.

    There are a few notes about your code:

    • input1 = [] doesn't close file. It will result in many files, which are opened by the program. You should call input1.close() instead.

    • [] means an empty array. It is better to use None if a variable should not contain any meaningful value.

    • To remove a variable (e.g. output), use del output.

    • After all, clearing all variables is not necessary. They will be freed with garbage collector.

    • Use os.path.join to create input1temp and input2temp.