Search code examples
python-3.xpypdf

Extracting text from a PDF file using PyPDF2


This is my first question here so I apologise if it ends up in the wrong place or I miss any valuable info. I am also incredibly new to coding & python in general.

I'm using Python 3.7.4 (v3.7.4:e09359112e, Jul 8 2019, 14:54:52)

I'm trying to write some code that will extract all the text from a PDF file and place it in a value (I know, simple stuff!).

I have managed to get this to work without issue on a 1 page pdf, but when trying on a 96 page PDF I am only getting the first and last page extracted to the value. This is the code I'm using:

pdfFile2 = open('/filepath/ir-2030.pdf', 'rb')
irReader = PyPDF2.PdfFileReader(pdfFile2)

pageNum2 = str(irReader.numPages)
print('Your document has ' + pageNum2 + ' pages' + '\n')

for pN in range(irReader.numPages):
    ir2030 = irReader.getPage(pN).extractText()

print(ir2030)

I have used almost identical coding previously and it worked without issue but for a reason unbeknownst to me, I'm only getting a return of page 1 and 96 of the pdf document from the print(ir2030)

Any help would be greatly appreciated or if there is a better way of doing what I'm trying to do...

Cheers


Solution

  • Every iteration, you reset the value of ir2030. Maybe append the values to a list?

    ir2030s = []
    for pN in range(irReader.numPages):
        ir2030s.append(irReader.getPage(pN).extractText())
    
    print(ir2030)
    

    Or use a list comprehension:

    ir2030s = [irReader.getPage(pN).extractText() for pN in range(irReader.numPages)]