Search code examples
pythonpython-3.xpypdf

How to extract all text from pdf?


I'm using the PYPDF2 lib to extract texts from a PDF but I'm having a problem doing the loop.

I'm using the following code and I can extract a string from the first page.

from PyPDF2 import PdfFileReader
reader = PdfFileReader("mypdf.pdf")
# Print number of pages
num_page = reader.getNumPages()
print(num_page)
# Print the number of pages where [0] is the first page
page = reader.pages[0]
print(page.extractText())

I would like to use the page number that I get with .GetNumPages() and iterate the number of times over reader.pages[0]

Code that I'm trying to print the 99 pages:

from PyPDF2 import PdfFileReader reader = PdfFileReader("mypdf.pdf")
# Print number of pages num_page = reader.getNumPages() print(num_page)
# Print the number of pages where [0] is the first page

page = reader.pages[0] i = 0 print(type(num_page)) print(type(i)) for i in page:
    if i < num_page:
        page = reader.pages[i]
        print(page.extractText())
        i = i + 1
    else:
        print("done")

Error occurred:

Traceback (most recent call last):
  File "/home/wilian/PycharmProjects/ExtractText/pypdf.py", line 13, in <module>
    if i < num_page:
TypeError: '<' not supported between instances of 'NameObject' and 'int'
99
<class 'int'>
<class 'int'>

Process finished with exit code 1

Solution

  • Try simple for range loop

    Example

    from PyPDF2 import PdfFileReader
    
    
    def pdf_info():
        with open("my_pdf.pdf", "rb") as f:
            reader = PdfFileReader(f)
            for i in range(reader.getNumPages()):
                print(i)
                # page = reader.pages[i]
                # print(page.extractText())
    
    
    if __name__ == '__main__':
        pdf_info()