Search code examples
python-3.xpypdf

Find Multiple Words from Multiple PDF Files with Python


I'm trying to write a Python Script which will load multiple PDF files and then search for specific words.

I have a script which will take 1 word and then try and find it in 1 PDF, which, like the word, is provided by myself. I was hoping to extend this script for multiple words and PDFs. I'm aware that the actual desired final script would require additional methods from the os module, however my knowledge of Python is a little sketchy at times.

Despite what I thought was going to be a basic task, Google keeps failing me, and it seems like I'm asking too much of a specific question, hence why I'm here.

What I have so far:

import PyPDF2 as PDF #import pdf module 
import re

p = PDF.PdfFileReader("UserJoe.pdf")

# get number of pages
NumPages = p.getNumPages()

#define keyterms; David, Final, End, Score, Birthday, Hello Ben

kTerm = "David, Final, End, Score, Birthday, Hello Ben"

#extract text and do the search
for i in range(0, NumPages):
    PageObj = p.getPage(i)
    print("Looking through page " + str(i))
    Text = PageObj.extractText()
    Result = re.search(kTerm,Text)

    if Result:
         print(f"{kTerm} found")
    else:
         print("0")

So this script works but not really how I want it to. It will only search for "David" but not the whole string of terms, which is what I want. And to re-iterate the point, I want this to work for multiple PDF files, not just 1, in which I have to provide the file name

Any help greatly appreciated


Solution

  • Your search term is wrong. re.search(kTerm,Text) will interpret kTerm as a regular expression. You define kTerm as "David, Final, End, Score, Birthday, Hello Ben" which is looking for an exact occurrence of David, Final, End, Score, Birthday, Hello Ben. You can replace the ", " with the pipe symbol ("|") which is like an or. If you do

    kTerm = "David, Final, End, Score, Birthday, Hello Ben".replace(", ", "|")
    

    which is "David|Final|End|Score|Birthday|Hello Ben" you search for either "David" or "Final" or "End" or...