python python-3.x xml-parsing data-extraction

How do you extract the pages from a pdf if you dont know how many pages it has?

I'm writing a code in Python 3 that takes in an XML file and from the links extracts the texts (currently trying with PyPDF2). I have written this function that tries to do it:

def DataExtraction(aspects_link):
#aspects_link is a list that has all the links from the XML file
    for i in aspects_link:
        reader = PyPDF2.PdfFileReader(aspects_link[i])
        #extracting the pages
        reader.getPage().extractText()

I get the error Parameter 'pageNumber' unfilled Since there are many links to extract from and I dont know how many pages each might be, I was wondering if there's a way to write the code in a way that extracts every page without me specifying how many there are.

Solution

You can know how many pages there are via getNumPages().

Based on this method, there are two properties: numPages and pages. The first is an alias of getNumPages, so it returns an int (how many pages do you have), while the latter is a list holding all pages objects.

for page in range(reader.getNumPages()): ...
for page in range(reader.numPages): ...
for page in reader.pages: ...

Note that, with the first two methods, you have an integer, so you need to call reader.getPage(page).extractText(); with the latter iteration, you already have a PageObject, so you just need to call page.extractText().

Here's an example of what your code looks like with the first possibility:

def DataExtraction(aspects_link):
    #aspects_link is a list that has all the links from the XML file
    for i in aspects_link:
        reader = PyPDF2.PdfFileReader(aspects_link[i])
        # extracting the pages
        for page in range(reader.getNumPages()):
            reader.getPage(page).extractText()