I want to extract all instances of a particular word from a PDF e.g 'math'. So far I am converting the PDF to Text using PyPDF2 and then doing regex on it to find what I want. This is the example PFD
When I run my code instead of returning my regular expression pattern of 'math' It returns a string of the whole page. Please help Thanks
#First Change Current Working Directory to desktop
import os
os.chdir('/Users/Hussein/Desktop') #File is located on Desktop
#Second is the PyPDF2
pdfFileObj=open('TEST1.pdf','rb') #Opening the File
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
pageObj=pdfReader.getPage(3) #For the test I only need page 3
TextVersion=pageObj.extractText()
print(TextVersion)
#Third is the Regular Expression
import re
match=re.findall(r'math',TextVersion)
for match in TextVersion:
print(match)
Instead of just getting all the instances of 'math' I receive this:
I
n
t
r
o
d
u
c
t
i
o
n
etc etc
The TextVersion
variable holds text. When you use it for a for
loop, it will give you the text a character at a time as you have seen. The findall
function will return a list of matches, so if you use this instead for your for
loop you will get each word (which in your test will be all the same).
import re
for match in re.findall(r'math',TextVersion):
print(match)
The returned result from findall
would be something like:
["math", "math", "math"]
So your output will then be:
math
math
math