Search code examples
pythoncsvpypdf

Appending pdf files based multilpe values in a dictionary key (or csv) results in too many pages


I am trying generate pdf files based on the county they fall in. If there is more than one pdf file per county then I need to append the files into a single file based on the county key. I can't seem to get the maps to append based on key. The final maps generated seem random and often have way too many files appended. I am pretty sure I am not grouping them correctly. I have read that multiple values in a key can result in showing up multiple times. Can someone please clue me in on how to access each value per key separately, one time only? Obviously I am not understanding something crucial.

My code:

import csv, os
import shutil
from PyPDF2 import PdfFileMerger, PdfFileReader, PdfFileWriter

merged_file = PdfFileMerger()
counties = {'County4': ['C:\\maps\\map2.pdf', 'C:\\maps\\map3.pdf', 'C:\\maps\\map4.pdf'], 'County1': ['C:\\maps\\map1.pdf', 'C:\\maps\\map2.pdf'], 'County3': ['C:\\maps\\map3.pdf'], 'County2': ['C:\\maps\\map1.pdf', 'C:\\maps\\map3.pdf']}
for k, v in counties.items():
    newPdfFile = ('C:\maps\JoinedMaps\k +'.pdf')
    if len(v) > 1:
        for filename in v:
            merged_file.append(PdfFileReader(filename,'rb'))
        merged_file.write(newPdfFile)
    else:
        for filename in v:
            shutil.copyfile(filename, newPdfFile)

I get four maps outputted (which is correct) but the number of "pages" (appended files) in some of these files is wildly off. As far as I can tell there is no rhyme or reason as to how these pages are appended. County4 pdf has 3 pages (correct), County1 pdf has 8 pages instead of 2, County3 pdf has 1 page (correct) and County2 has 15 pages instead of 2.

EDIT:

It turns out pyPDF2 does not like iterating through and creating files using the concept of group-by. I imagine it has something to so with how it stores memory. The results are the creation of increasingly greater number of pages as you iterate through the key values. I spent days thinking it was my coding. Good to know it wasn't I guess but I am surprised this piece of information is not "out there on the internet" better.

My solution was to use arcpy, which doesn't help most users reading this, sorry to say.

For those looking at my solution, my csv file looked like this:

County1   C:\maps\map1.pdf
County1   C:\maps\map2.pdf
County2   C:\maps\map1.pdf
County2   C:\maps\map3.pdf
County3   C:\maps\map3.pdf
County4   C:\maps\map2.pdf
County4   C:\maps\map3.pdf
County4   C:\maps\map4.pdf

and my resulting pdf files looked like this:

County-County1 (2 pages - Map1 and Map2)
County-County2 (2 pages - Map1 and Map3)
County-County3 (1 page - Map3)
County-County2 (3 pages - Map2, Map3, and Map4)

Solution

  • My data started out as a csv file and the code below references this instead of the dictionaries (which were generated from the csv file) which I used in the above example, but you should be able to glean what I did based on code below. I basically scraped the dictionary idea and went with reading the csv file line by line and then appending using arcpy. pyPDF2 does NOT merge correctly when trying to output multiple files based on a key. Three days of my life I can't get back

    import csv
    import arcpy
    from arcpy import env
    import shutil, os, glob
    
    # clear out files from destination directory
    files = glob.glob(r'C:\maps\JoinedMaps\*')
    for f in files:
        os.remove(f)
    
    # open csv file
    f = open("C:\maps\Maps.csv", "r+")
    ff = csv.reader(f)
    
    # set variable to establish previous row of csv file (for comaprrison)
    pre_line = ff.next()
    
    # Iterate through csv file
    
    for cur_line in ff:
        # new file name and location based on value in column (county name)
        newPdfFile = (r'C:\maps\JoinedMaps\County-' + cur_line[0] +'.pdf')
        # establish pdf files to be appended
        joinFile = pre_line[1]
        appendFile = cur_line[1]
    
        # If columns in both rows match
        if pre_line[0] == cur_line[0]: # <-- compare first column
            # If destnation file already exists, append file referenced in current row
            if os.path.exists(newPdfFile):
                tempPdfDoc = arcpy.mapping.PDFDocumentOpen(newPdfFile)
                tempPdfDoc.appendPages(appendFile)
            # Otherwise create destination and append files reference in both the previous and current row
            else:
                tempPdfDoc = arcpy.mapping.PDFDocumentCreate(newPdfFile)
                tempPdfDoc.appendPages(joinFile)
                tempPdfDoc.appendPages(appendFile)
            # save and delete temp file
            tempPdfDoc.saveAndClose()
            del tempPdfDoc
        else:
            # if no match, do not merge, just copy
            shutil.copyfile(appendFile,newPdfFile)
    
        # reset variable
        pre_line = cur_line