I tried to use PyPDF2
with Python3 to search for keywords from a given file. The function is searchFromFile(path:str,keyword:str) -> List[PageObject]
as the following:
def searchFromFile(path:str,keyword:str) -> List[PageObject]:
pdf = pypdf.PdfFileReader(open(path, "rb"))
if pdf.isEncrypted:
pdf.decrypt('')
numberOfPages = pdf.getNumPages()
result = [PageObject]
for pageNumber in range(0,numberOfPages):
page = pdf.getPage(pageNumber)
text = page.extractText()
if keyword in text:
result.append(page)
return result
if __name__ == '__main__':
resultList = searchFromFile(sys.argv[1], sys.argv[2])
for page in resultList:
print("page content:",page.extractText())
The return type is list of PageObject
so that I could use the methods of PageObject
in the main
like in the code above. But got the following error:
Traceback (most recent call last):
File"C:\Users\...\git\python\venv\pdftool\SearchFromPdf.py", line 28, in <module>
print("page content:",page.extractText())
TypeError: extractText() missing 1 required positional argument: 'self'
Question: how to resolve this error?
May you confirm that this code will work ?
#sudo apt-get install python3-pypdf2
import PyPDF2 as pypdf
def searchFromFile(path:str,keyword:str):
pdf = pypdf.PdfFileReader(open(path, "rb"))
if pdf.isEncrypted:
pdf.decrypt('')
numberOfPages = pdf.getNumPages()
# ~ result = [PageObject]
result = []
for pageNumber in range(0,numberOfPages):
print ("page",pageNumber,"/",numberOfPages)
page = pdf.getPage(pageNumber)
text = page.extractText()
if keyword in text:
result.append(page)
return result
if __name__ == '__main__':
resultList = searchFromFile(sys.argv[1], sys.argv[2])
for page in resultList:
print("page content:",page.extractText())