Search code examples
pythonamazon-web-servicespdfaws-lambdapypdf

Splitting PDF with PyPDF2 in Lambda function


I'm probably doing something really stupid here but I've got the following Lambda function to split an uploaded PDF into individual pages. When I upload an 8-page PDF, it creates 8 identical copies of the original PDF.

I must be doing something stupid but am not sure what.. Help!

import boto3
from PyPDF2 import PdfReader, PdfWriter

s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Retrieve the uploaded file details from the event
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    file_key = event['Records'][0]['s3']['object']['key']
    file_name = file_key.split('/')[-1]  # Extract the original file name

    # Prepare the output directory path
    output_dir = 'PCP/temp/'  # Specify your desired output directory
    output_prefix = file_name.split('.')[0] + '-'  # Prefix for split file names

    # Download the uploaded file to temp storage
    temp_file_path = '/tmp/' + file_name
    s3.download_file(bucket_name, file_key, temp_file_path)

    # Read the uploaded PDF file
    pdf = PdfReader(temp_file_path)

    # Split the PDF into individual pages and save them
    for page_number in range(len(pdf.pages)):
        print (f"Page {page_number}")
        temp_output_path = f"/tmp/{output_prefix}{page_number + 1}.pdf"
        output_page_path = f"{output_dir}{output_prefix}{page_number + 1}.pdf"
        output_pdf = PdfWriter()
        output_pdf.add_page(pdf.pages[page_number])

        with open(temp_output_path, 'wb') as output_file:
            output_pdf.write(output_file)
        
        # Upload the split page to S3 bucket
        s3.upload_file(temp_file_path, bucket_name, output_page_path)

    return {
        'statusCode': 200,
        'body': 'PDF splitting completed successfully.'
    }

Solution

  • When you call s3.upload_file you are passing temp_file_path which references the original downloaded file rather than temp_output_path which is where you wrote the current page within the for loop.

    I recommend using more descriptive variable names to help avoid such issues that are easy to miss with similar, generic variable names. Consider re-naming temp_file_path to downloaded_pdf_path and temp_output_path to current_page_path.