Search code examples
typeerrorpython-3.8

Using Python 3.8, I would like to extract text from a random PDF file


I would like to import a PDF file and find the most common words.

import PyPDF2

# Open the PDF file and read the text
pdf_file = open("nita20.pdf", "rb")
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page in range(pdf_reader.pages):
    text += pdf_reader.getPage(page).extractText()

I get this error:

TypeError: '_VirtualList' object cannot be interpreted as an integer

How to resolve this issue? So I can extract every word from the PDF file, thanks.


Solution

  • I got some deprecation warnings on your code, but this works (tested on Python 3.11, PyPDF2 version: 3.0.1)

    import PyPDF2
    
    # Open the PDF file and read the text
    pdf_file = open("..\test.pdf", "rb")
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    text = ""
    i=0
    print(len(pdf_reader.pages))
    for page in range(len(pdf_reader.pages)):
        text += pdf_reader.pages[i].extract_text()
        i=i+1
    
    print(text)