I'm probably doing something really stupid here but I've got the following Lambda function to split an uploaded PDF into individual pages. When I upload an 8-page PDF, it creates 8 identical copies of the original PDF.
I must be doing something stupid but am not sure what.. Help!
import boto3
from PyPDF2 import PdfReader, PdfWriter
s3 = boto3.client('s3')
def lambda_handler(event, context):
# Retrieve the uploaded file details from the event
bucket_name = event['Records'][0]['s3']['bucket']['name']
file_key = event['Records'][0]['s3']['object']['key']
file_name = file_key.split('/')[-1] # Extract the original file name
# Prepare the output directory path
output_dir = 'PCP/temp/' # Specify your desired output directory
output_prefix = file_name.split('.')[0] + '-' # Prefix for split file names
# Download the uploaded file to temp storage
temp_file_path = '/tmp/' + file_name
s3.download_file(bucket_name, file_key, temp_file_path)
# Read the uploaded PDF file
pdf = PdfReader(temp_file_path)
# Split the PDF into individual pages and save them
for page_number in range(len(pdf.pages)):
print (f"Page {page_number}")
temp_output_path = f"/tmp/{output_prefix}{page_number + 1}.pdf"
output_page_path = f"{output_dir}{output_prefix}{page_number + 1}.pdf"
output_pdf = PdfWriter()
output_pdf.add_page(pdf.pages[page_number])
with open(temp_output_path, 'wb') as output_file:
output_pdf.write(output_file)
# Upload the split page to S3 bucket
s3.upload_file(temp_file_path, bucket_name, output_page_path)
return {
'statusCode': 200,
'body': 'PDF splitting completed successfully.'
}
When you call s3.upload_file
you are passing temp_file_path
which references the original downloaded file rather than temp_output_path
which is where you wrote the current page within the for
loop.
I recommend using more descriptive variable names to help avoid such issues that are easy to miss with similar, generic variable names. Consider re-naming temp_file_path
to downloaded_pdf_path
and temp_output_path
to current_page_path
.