Search code examples
pythonpandaspdfpypdf

Keep selected pages from PDF


I have a pandas dataframe, pdf_summary, which is sorted and has 50 unique rows. Each row is a particular combination of file_pages. How could I create a folder and PDF for each file_name?

pdf_path = "Documents/menu.pdf"
pdf_summary

            file_name             file_pages_to_keep
  1 - Monday, Wednesday                1,3
  2 - Monday                            1
  3 - Monday, Tuesday, Wednesday       1,2,3
...
  50 - Friday                           5 

The expected output would be 50 folders, with one PDF inside each folder with only those file_pages taken from the menu.pdf.

"Documents/1 - Monday, Wednesday/1 - Monday, Wednesday.pdf" (PDF only has pages 1 and 3 from menu.pdf)
...

Solution

  • first, you define a function for writing a pdf into a folder that allows you to select the pages:

    import os
    from PyPDF2 import PdfReader, PdfWriter
    
    def extract_pages(input_pdf, output_pdf, pages):
        with open(input_pdf, "rb") as file:
            reader = PdfReader(file)
            writer = PdfWriter()
            for page_num in pages:
                writer.add_page(reader.pages[page_num - 1])  # Page numbers start from 0
            with open(output_pdf, "wb") as output_file:
                writer.write(output_file)
    

    Then you iterate over the rows of your df and for every row, you store the name of the pdf file (based on the file_name column) and the pages that you have to write:

    for index, row in pdf_summary.iterrows():
        # Create a folder with the file_name if it doesn't exist
        folder_name = row['file_name']
        folder_path = os.path.join("output_folders", folder_name)
        os.makedirs(folder_path, exist_ok=True)
    
        # Extract pages to keep from the PDF
        file_pages_to_keep = [int(page) for page in row['file_pages_to_keep'].split(',')]
        output_pdf_path = os.path.join(folder_path, f"{folder_name}.pdf")
    
        # Create a new PDF with the specified pages
        extract_pages(pdf_path, output_pdf_path, file_pages_to_keep)