Search code examples
pythonfor-loopstring-concatenation

for loop concatenating strings as part of bulk PDF read and search


Hoping for some help concatenating text strings in a for loop. I have written the below code. My for page_num in range(no_pages) loop however is only adding the final page of my PDF to the variable all_text. What am I doing wrong?

If I do the following I get the text correctly concatenated. The PDF file is two pages long (no_pages =2)

page1 = pdfReader.getPage(0).extractText()
page2 = pdfReader.getPage(1).extractText()
all_text = page1 + page2

This is my full code on a test file, 'H:\PyTest\Test file 3.pdf'

import os
import datetime
import PyPDF2
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

search_dir = 'H:\PyTest\Test file 3.pdf'

pdfFileObj = open(search_dir, 'rb') 

pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

no_pages = pdfReader.numPages
no_pages


for page_num in range(no_pages):
    all_text = ""
    new_text = pdfReader.getPage(page_num).extractText()
    all_text += new_text 

print(sent_tokenize(all_text))

word_search = ['Random', 'Dynamic', 'Company', 'Stake', 'results']

for item in word_search: 
    if item in all_text:
        print(item + ': Found')
    else:
        print(item + ': Not Found')

pdfFileObj.close() 

Ideally I do not want to create new files to copy text to/save, as this function is to sit as part of a wider function that:

  1. walks through a large directory of files,
  2. searches each pdf document in the directory tree for the list of search words,
  3. print the file name it was found in and the creation date of the file
  4. print the sentence it is in, if possible (ideally would like the paragraph but need to explore nltk further to see if that is possible.

To confirm, this is the piece of code that isn't working as expected:

for page_num in range(no_pages):
    all_text = ""
    new_text = pdfReader.getPage(page_num).extractText()
    all_text += new_text 

Solution

  • In your for loop each time all_text becomes empty ''

    You need to place all_text = '' before the loop

    all_text = ""
    
    for page_num in range(no_pages):
        new_text = pdfReader.getPage(page_num).extractText()
        all_text += new_text
    

    Pythonic way to concatenate string is using join method with list comprehension.

    all_text = ''.join([text for text in pdfReader.getPage(page_num).extractText()])