How do I access the text from a specific pdf page rather than the entire document

I am trying to extract some stuff from some pdf documents. I have been mucking around with various tools though I have invested the most in pdfminer and pymupdf. I started with pdfminer but started testing pymupdf after not being able to address one specific problem - that is when my pdf document has a number of pages I want to choose whether or not to process each specific page. However, the problem I am running into with both libraries is that when I try to retrieve the text from one specific page (or another) the text that is returned is all of the text from the document.

Here is a link to a document that has 57 pages.

I will focus here on the case of using pymupdf

Here is some code

import fitz
doc = fitz.open('local_path_to_file_from_link_above')
for page in doc:
    text = page.getText().encode("utf8")
    break

I am breaking here to confirm that I pulled the text from one and only one page - but when I inspect text I discover it has almost all the text from the entire document (all 57 pages)

So I was curious if despite the appearance of page boundaries in the pdf file - perhaps they do not exist so I used the pageCount attribute/property/method to determine if the pages were present - they seem to be

>>> doc.pageCount
57

It is a little hard to describe the output when I loop through all of the pages each page does not have exactly all of the content from all the pages but it has almost all of the content. I determined this by using the following code

mydict = dict()
for n, page in doc:
    print n, len(p.getText())
    mydict[n] = p.getText()

Here is the output - for completeness

So there is an aberration regarding the content of page 29 and there is variation in the length of the text retrieved from the pages but poking around at it there seems to be significant overlap for example

>>> mydict[0][0:5000] == mydict[1][0:5000]
True

but

>>> mydict[0][-5000:] == mydict[1][-5000:]
False

To sum this up - the library seems to understand the existing page boundaries but the text that is retrieved for an individual page is almost all of the text in the document. Since the generates a good ToC - I want to use that and the page numbers that are provided from that ToC to identify the specific pages that I want to further parse and extract data.

I will observe that I ran into similar problems trying to use pdfminer. I could retrieve all of the text but not just the text from a specific, specified page.

Solution

Try the following to get the text from any specific page of that pdf.

import fitz

path = r''

doc = fitz.open(path)
page = doc.loadPage(1) #put here the page number
page_to_text = page.getText("text")
print(page_to_text)