Search code examples
pythonpdftext-miningpypdf

Extract First Page of All PDF Documents in a Library


I am new to PDF Handling in Python. I have a document library which contains a large volume of PDF Documents. I am trying to extract the First Page of each document. I have produced the below code.

My initial for loop "for entry in entries" returns the name of all documents in the library. I verify this by successfully printing all document names in the library.

I am using the pdfReader.getPage to specify the page number of each document whilst also using the extractText function to extract the text from the page. However, when i run this entire script, I am being thrown an error which states that one of the documents cannot be located. However, the document does exist in the library. This is shown in the screenshot from the library below. Whilst also verified by the fact that it prints in the list of documents in the repository.

I believe the issue is with how the extractText is iterating through all documents but I am unclear on how to resolve. Would anyone have any suggestions?

import os
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader

# get the file names in the directory
directory = 'Fund Docs'
entries = os.listdir(directory)


for entry in entries:
    print(entry)
    # create a PDF reader object
    pdfFileObj = open(entry, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    print(pdfReader.numPages)

    # creating a page object
    pageObj = pdfReader.getPage(0)

    # extracting text from page
    print(pageObj.extractText())

    # closing the pdf file object
    pdfFileObj.close()


enter image description here

enter image description here


Solution

  • You need to specify the full path:

    pdfFileObj = open(directory + '/' + entry, 'rb')
    

    This will open the file at Fund Docs/FILE_NAME.pdf. By only specifying entry, it will look for the file in the current folder, which it won't find. By adding the folder at the start, you're saying to find the entry inside that folder.