This is my first question here so I apologise if it ends up in the wrong place or I miss any valuable info. I am also incredibly new to coding & python in general.
I'm using Python 3.7.4 (v3.7.4:e09359112e, Jul 8 2019, 14:54:52)
I'm trying to write some code that will extract all the text from a PDF file and place it in a value (I know, simple stuff!).
I have managed to get this to work without issue on a 1 page pdf, but when trying on a 96 page PDF I am only getting the first and last page extracted to the value. This is the code I'm using:
pdfFile2 = open('/filepath/ir-2030.pdf', 'rb')
irReader = PyPDF2.PdfFileReader(pdfFile2)
pageNum2 = str(irReader.numPages)
print('Your document has ' + pageNum2 + ' pages' + '\n')
for pN in range(irReader.numPages):
ir2030 = irReader.getPage(pN).extractText()
print(ir2030)
I have used almost identical coding previously and it worked without issue but for a reason unbeknownst to me, I'm only getting a return of page 1 and 96 of the pdf document from the print(ir2030)
Any help would be greatly appreciated or if there is a better way of doing what I'm trying to do...
Cheers
Every iteration, you reset the value of ir2030
. Maybe append the values to a list?
ir2030s = []
for pN in range(irReader.numPages):
ir2030s.append(irReader.getPage(pN).extractText())
print(ir2030)
Or use a list comprehension:
ir2030s = [irReader.getPage(pN).extractText() for pN in range(irReader.numPages)]