Search code examples
pythonregexpython-3.xpdfpypdf

Not receiving correct pattern from regex on PyPDF2 for a PDF


I want to extract all instances of a particular word from a PDF e.g 'math'. So far I am converting the PDF to Text using PyPDF2 and then doing regex on it to find what I want. This is the example PFD

When I run my code instead of returning my regular expression pattern of 'math' It returns a string of the whole page. Please help Thanks

#First Change Current Working Directory to desktop

import os
os.chdir('/Users/Hussein/Desktop')         #File is located on Desktop


#Second is the PyPDF2

pdfFileObj=open('TEST1.pdf','rb')          #Opening the File
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
pageObj=pdfReader.getPage(3)               #For the test I only need page 3
TextVersion=pageObj.extractText()
print(TextVersion)



#Third is the Regular Expression

import re
match=re.findall(r'math',TextVersion)
for match in TextVersion:
      print(match)

Instead of just getting all the instances of 'math' I receive this:

I
n
t
r
o
d
u
c
t
i
o
n

etc etc


Solution

  • The TextVersion variable holds text. When you use it for a for loop, it will give you the text a character at a time as you have seen. The findall function will return a list of matches, so if you use this instead for your for loop you will get each word (which in your test will be all the same).

    import re
    
    for match in re.findall(r'math',TextVersion):
          print(match)
    

    The returned result from findall would be something like:

    ["math", "math", "math"]
    

    So your output will then be:

    math
    math
    math