Using PdfFileMerger in Python to merge PDFs with the same name, but different numbers

I have a directory full of individual PDFs that need to be merged together, based on their name. Each individual pdf file has one page. The naming convention for each file consists of a string name and a number. This is roughly what my directory looks like:

A_001.pdf A_002.pdf A_003.pdf B_001.pdf B_002.pdf B_003.pdf B_004.pdf

I basically need one PDF for A (pdf would have 3 pages) and one PDF for B (pdf would have 4 pages).The _001 and so forth should be the page number. My current Python script does output A.pdf and B.pdf, but includes pages from both A and B.

import PyPDF2, os
from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger
from pathlib import Path

single_file_dir = r'Y:\Python\Single_PDFs'
binder_file_dir = r'Y:\Python\Combined_PDFs'

# get list of all files in the single PDF directory
single_file_list = []
for file in os.listdir(single_file_dir):
    if file.endswith(".pdf"):
        single_file_list.append(single_file_dir + "\\" + file)

print(single_file_list)

# get the file names for the output multi page pdfs

file_name_list = []
for file in single_file_list:
    name = os.path.basename(file)
    new_name = name[:-8]
    file_name_list.append(new_name)
    unique_file_name_list = list(set(file_name_list))

merger = PdfFileMerger()

print(unique_file_name_list)

#try to match input single file name to output file name
for file in single_file_list:
    for name in unique_file_name_list:
        if name in file:
            merger.append(file)
            merger.write(binder_file_dir + "\\" + name + ".pdf")

This script does result in A.pdf and B.pdf, but both output PDFs include many duplicates of both the A single PDFs and the B single PDFs. My goal is to have A_001.pdf, A_002.pdf, A_003.pdf merged into one multi-page pdf. Same with the B series PDFs.

Solution

I think your problem may be coming from reusing your pdf merger.

This code is adapted from another script I use to merge pdfs. Let me know if it works for you.

from collections import defaultdict
from pathlib import Path

from PyPDF2 import PdfMerger

single_file_dir = Path("Y:/") / "Python" / "Single_PDFs"
binder_file_dir = Path("Y:/") / "Python" / "Combined_PDFs"

file_groups: defaultdict[str, list[Path]] = defaultdict(list)
for file in single_file_dir.glob("*.pdf"):
    group = file.name[0]  # However you want to determine the group from the filename
    file_groups[group].append(file)

for group, files in file_groups.items():
    merger = PdfMerger()
    for file in sorted(files):
        merger.append(str(file))

    with open(binder_file_dir / f"{group}.pdf", "wb") as binder:
        merger.write(binder)

Notes:

I like using the pathlib module to avoid dealing with the platform specific idiosyncrasies of paths (especially \'s on windows)