I'm writing a code in Python 3 that takes in an XML file and from the links extracts the texts (currently trying with PyPDF2). I have written this function that tries to do it:
def DataExtraction(aspects_link):
#aspects_link is a list that has all the links from the XML file
for i in aspects_link:
reader = PyPDF2.PdfFileReader(aspects_link[i])
#extracting the pages
reader.getPage().extractText()
I get the error Parameter 'pageNumber' unfilled Since there are many links to extract from and I dont know how many pages each might be, I was wondering if there's a way to write the code in a way that extracts every page without me specifying how many there are.
You can know how many pages there are via getNumPages()
.
Based on this method, there are two properties: numPages
and pages
. The first is an alias of getNumPages
, so it returns an int (how many pages do you have), while the latter is a list holding all pages objects.
for page in range(reader.getNumPages()): ...
for page in range(reader.numPages): ...
for page in reader.pages: ...
Note that, with the first two methods, you have an integer, so you need to call reader.getPage(page).extractText()
; with the latter iteration, you already have a PageObject, so you just need to call page.extractText()
.
Here's an example of what your code looks like with the first possibility:
def DataExtraction(aspects_link):
#aspects_link is a list that has all the links from the XML file
for i in aspects_link:
reader = PyPDF2.PdfFileReader(aspects_link[i])
# extracting the pages
for page in range(reader.getNumPages()):
reader.getPage(page).extractText()