I'm trying to write a Python Script which will load multiple PDF files and then search for specific words.
I have a script which will take 1 word and then try and find it in 1 PDF, which, like the word, is provided by myself. I was hoping to extend this script for multiple words and PDFs. I'm aware that the actual desired final script would require additional methods from the os module, however my knowledge of Python is a little sketchy at times.
Despite what I thought was going to be a basic task, Google keeps failing me, and it seems like I'm asking too much of a specific question, hence why I'm here.
What I have so far:
import PyPDF2 as PDF #import pdf module
import re
p = PDF.PdfFileReader("UserJoe.pdf")
# get number of pages
NumPages = p.getNumPages()
#define keyterms; David, Final, End, Score, Birthday, Hello Ben
kTerm = "David, Final, End, Score, Birthday, Hello Ben"
#extract text and do the search
for i in range(0, NumPages):
PageObj = p.getPage(i)
print("Looking through page " + str(i))
Text = PageObj.extractText()
Result = re.search(kTerm,Text)
if Result:
print(f"{kTerm} found")
else:
print("0")
So this script works but not really how I want it to. It will only search for "David" but not the whole string of terms, which is what I want. And to re-iterate the point, I want this to work for multiple PDF files, not just 1, in which I have to provide the file name
Any help greatly appreciated
Your search term is wrong. re.search(kTerm,Text)
will interpret kTerm
as a regular expression. You define kTerm as "David, Final, End, Score, Birthday, Hello Ben"
which is looking for an exact occurrence of David, Final, End, Score, Birthday, Hello Ben
. You can replace the ", " with the pipe symbol ("|") which is like an or. If you do
kTerm = "David, Final, End, Score, Birthday, Hello Ben".replace(", ", "|")
which is "David|Final|End|Score|Birthday|Hello Ben"
you search for either "David" or "Final" or "End" or...