Hoping for some help concatenating text strings in a for loop. I have written the below code. My for page_num in range(no_pages)
loop however is only adding the final page of my PDF to the variable all_text. What am I doing wrong?
If I do the following I get the text correctly concatenated. The PDF file is two pages long (no_pages =2)
page1 = pdfReader.getPage(0).extractText()
page2 = pdfReader.getPage(1).extractText()
all_text = page1 + page2
This is my full code on a test file, 'H:\PyTest\Test file 3.pdf'
import os
import datetime
import PyPDF2
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
search_dir = 'H:\PyTest\Test file 3.pdf'
pdfFileObj = open(search_dir, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
no_pages = pdfReader.numPages
no_pages
for page_num in range(no_pages):
all_text = ""
new_text = pdfReader.getPage(page_num).extractText()
all_text += new_text
print(sent_tokenize(all_text))
word_search = ['Random', 'Dynamic', 'Company', 'Stake', 'results']
for item in word_search:
if item in all_text:
print(item + ': Found')
else:
print(item + ': Not Found')
pdfFileObj.close()
Ideally I do not want to create new files to copy text to/save, as this function is to sit as part of a wider function that:
To confirm, this is the piece of code that isn't working as expected:
for page_num in range(no_pages):
all_text = ""
new_text = pdfReader.getPage(page_num).extractText()
all_text += new_text
In your for
loop each time all_text
becomes empty ''
You need to place all_text = ''
before the loop
all_text = ""
for page_num in range(no_pages):
new_text = pdfReader.getPage(page_num).extractText()
all_text += new_text
Pythonic way to concatenate string is using join
method with list comprehension
.
all_text = ''.join([text for text in pdfReader.getPage(page_num).extractText()])