Search code examples
pythonpdfpdf-generation

Using PdfFileMerger in Python to merge PDFs with the same name, but different numbers


I have a directory full of individual PDFs that need to be merged together, based on their name. Each individual pdf file has one page. The naming convention for each file consists of a string name and a number. This is roughly what my directory looks like:

A_001.pdf A_002.pdf A_003.pdf B_001.pdf B_002.pdf B_003.pdf B_004.pdf

I basically need one PDF for A (pdf would have 3 pages) and one PDF for B (pdf would have 4 pages).The _001 and so forth should be the page number. My current Python script does output A.pdf and B.pdf, but includes pages from both A and B.

import PyPDF2, os
from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger
from pathlib import Path

single_file_dir = r'Y:\Python\Single_PDFs'
binder_file_dir = r'Y:\Python\Combined_PDFs'

# get list of all files in the single PDF directory
single_file_list = []
for file in os.listdir(single_file_dir):
    if file.endswith(".pdf"):
        single_file_list.append(single_file_dir + "\\" + file)

print(single_file_list)

# get the file names for the output multi page pdfs

file_name_list = []
for file in single_file_list:
    name = os.path.basename(file)
    new_name = name[:-8]
    file_name_list.append(new_name)
    unique_file_name_list = list(set(file_name_list))

merger = PdfFileMerger()

print(unique_file_name_list)

#try to match input single file name to output file name
for file in single_file_list:
    for name in unique_file_name_list:
        if name in file:
            merger.append(file)
            merger.write(binder_file_dir + "\\" + name + ".pdf")

This script does result in A.pdf and B.pdf, but both output PDFs include many duplicates of both the A single PDFs and the B single PDFs. My goal is to have A_001.pdf, A_002.pdf, A_003.pdf merged into one multi-page pdf. Same with the B series PDFs.


Solution

  • I think your problem may be coming from reusing your pdf merger.

    This code is adapted from another script I use to merge pdfs. Let me know if it works for you.

    from collections import defaultdict
    from pathlib import Path
    
    from PyPDF2 import PdfMerger
    
    single_file_dir = Path("Y:/") / "Python" / "Single_PDFs"
    binder_file_dir = Path("Y:/") / "Python" / "Combined_PDFs"
    
    file_groups: defaultdict[str, list[Path]] = defaultdict(list)
    for file in single_file_dir.glob("*.pdf"):
        group = file.name[0]  # However you want to determine the group from the filename
        file_groups[group].append(file)
    
    for group, files in file_groups.items():
        merger = PdfMerger()
        for file in sorted(files):
            merger.append(str(file))
    
        with open(binder_file_dir / f"{group}.pdf", "wb") as binder:
            merger.write(binder)
    

    Notes:

    I like using the pathlib module to avoid dealing with the platform specific idiosyncrasies of paths (especially \'s on windows)