Search code examples
pythonpython-3.xpypdf

How to properly use the returned PageObject to extractText() with PyPDF2


I tried to use PyPDF2 with Python3 to search for keywords from a given file. The function is searchFromFile(path:str,keyword:str) -> List[PageObject] as the following:

def searchFromFile(path:str,keyword:str) -> List[PageObject]:
  pdf = pypdf.PdfFileReader(open(path, "rb"))
  if pdf.isEncrypted:
    pdf.decrypt('')
  numberOfPages = pdf.getNumPages()
  result = [PageObject]
  for pageNumber in range(0,numberOfPages):
    page = pdf.getPage(pageNumber)
    text = page.extractText()
    if keyword in text:
        result.append(page)
  return result

if __name__ == '__main__':
  resultList = searchFromFile(sys.argv[1], sys.argv[2])
  for page in resultList:
    print("page content:",page.extractText())

The return type is list of PageObject so that I could use the methods of PageObject in the main like in the code above. But got the following error:

Traceback (most recent call last):
File"C:\Users\...\git\python\venv\pdftool\SearchFromPdf.py", line 28, in <module>
print("page content:",page.extractText())
TypeError: extractText() missing 1 required positional argument: 'self'

Question: how to resolve this error?


Solution

  • May you confirm that this code will work ?

    #sudo apt-get install python3-pypdf2
    
    import PyPDF2 as pypdf
    
    def searchFromFile(path:str,keyword:str):
      pdf = pypdf.PdfFileReader(open(path, "rb"))
      if pdf.isEncrypted:
        pdf.decrypt('')
      numberOfPages = pdf.getNumPages()
      # ~ result = [PageObject]
      result = []
      for pageNumber in range(0,numberOfPages):
        print ("page",pageNumber,"/",numberOfPages)
        page = pdf.getPage(pageNumber)
        text = page.extractText()
        if keyword in text:
            result.append(page)
      return result
    
    if __name__ == '__main__':
      resultList = searchFromFile(sys.argv[1], sys.argv[2])
      for page in resultList:
        print("page content:",page.extractText())