Search code examples
pythonpython-3.xxml-parsingdata-extraction

How do you extract the pages from a pdf if you dont know how many pages it has?


I'm writing a code in Python 3 that takes in an XML file and from the links extracts the texts (currently trying with PyPDF2). I have written this function that tries to do it:

def DataExtraction(aspects_link):
#aspects_link is a list that has all the links from the XML file
    for i in aspects_link:
        reader = PyPDF2.PdfFileReader(aspects_link[i])
        #extracting the pages
        reader.getPage().extractText()

I get the error Parameter 'pageNumber' unfilled Since there are many links to extract from and I dont know how many pages each might be, I was wondering if there's a way to write the code in a way that extracts every page without me specifying how many there are.


Solution

  • You can know how many pages there are via getNumPages().

    Based on this method, there are two properties: numPages and pages. The first is an alias of getNumPages, so it returns an int (how many pages do you have), while the latter is a list holding all pages objects.

    for page in range(reader.getNumPages()): ...
    for page in range(reader.numPages): ...
    for page in reader.pages: ...
    

    Note that, with the first two methods, you have an integer, so you need to call reader.getPage(page).extractText(); with the latter iteration, you already have a PageObject, so you just need to call page.extractText().

    Here's an example of what your code looks like with the first possibility:

    def DataExtraction(aspects_link):
        #aspects_link is a list that has all the links from the XML file
        for i in aspects_link:
            reader = PyPDF2.PdfFileReader(aspects_link[i])
            # extracting the pages
            for page in range(reader.getNumPages()):
                reader.getPage(page).extractText()