I have a document library which consists of several hundred PDF Documents. I am attempting to export the first page of each PDF document. Below is my script which extracts the page. It saves each page as an individual PDF. However, the files which are exported seem to be exporting in unreadable or damaged format.
Is there something missing from my script?
import os
from PyPDF2 import PdfReader, PdfWriter
# get the file names in the directory
input_directory = "Fund_Docs_Sample"
entries = os.listdir(input_directory)
output_directory = "First Pages"
outputs = os.listdir(output_directory)
for output_file_name in entries:
reader = PdfReader(input_directory + "/" + output_file_name)
page = reader.pages[0]
first_page = "\n" + page.extract_text() + "\n"
with open(output_file_name, "wb") as outputStream:
pdf_writer = PdfWriter(output_file_name + first_page)
pdf_writer.write(outputStream)
output_directory
is not used at allAfter reading the comments, you likely want this:
from pathlib import Path
from PyPDF2 import PdfReader
# get the file names in the directory
input_directory = Path("Fund_Docs_Sample")
output_directory = Path("First Pages")
for input_file_path in input_directory.glob("*.pdf"):
print(input_file_path)
reader = PdfReader(input_file_path)
page = reader.pages[0]
first_page_text = "\n" + page.extract_text() + "\n"
# create the output text file path
output_file_path = output_directory / f"{input_file_path.name}.txt"
# write the text to the output file
with open(output_file_path, "w") as output_file:
output_file.write(first_page_text)