I am trying to extract some stuff from some pdf documents. I have been mucking around with various tools though I have invested the most in pdfminer and pymupdf. I started with pdfminer but started testing pymupdf after not being able to address one specific problem - that is when my pdf document has a number of pages I want to choose whether or not to process each specific page. However, the problem I am running into with both libraries is that when I try to retrieve the text from one specific page (or another) the text that is returned is all of the text from the document.
Here is a link to a document that has 57 pages.
I will focus here on the case of using pymupdf
Here is some code
import fitz
doc = fitz.open('local_path_to_file_from_link_above')
for page in doc:
text = page.getText().encode("utf8")
break
I am breaking here to confirm that I pulled the text from one and only one page - but when I inspect text I discover it has almost all the text from the entire document (all 57 pages)
So I was curious if despite the appearance of page boundaries in the pdf file - perhaps they do not exist so I used the pageCount attribute/property/method to determine if the pages were present - they seem to be
>>> doc.pageCount
57
It is a little hard to describe the output when I loop through all of the pages each page does not have exactly all of the content from all the pages but it has almost all of the content. I determined this by using the following code
mydict = dict()
for n, page in doc:
print n, len(p.getText())
mydict[n] = p.getText()
Here is the output - for completeness
0 45491
1 45491
2 45491
3 45491
4 45491
5 45491
6 45491
7 45491
8 45491
9 45492
10 45492
11 45492
12 45492
13 45492
14 45492
15 45492
16 45492
17 45492
18 45492
19 45492
20 45492
21 45492
22 45492
23 45492
24 45492
25 45492
26 45492
27 45492
28 45492
29 88408
30 42990
31 42990
32 42990
33 42990
34 42990
35 42990
36 42990
37 42990
38 42990
39 42990
40 42990
41 42990
42 42990
43 42990
44 42990
45 42990
46 42990
47 42990
48 42990
49 42990
50 42990
51 42990
52 42990
53 42990
54 42990
55 42990
56 42990
So there is an aberration regarding the content of page 29 and there is variation in the length of the text retrieved from the pages but poking around at it there seems to be significant overlap for example
>>> mydict[0][0:5000] == mydict[1][0:5000]
True
but
>>> mydict[0][-5000:] == mydict[1][-5000:]
False
To sum this up - the library seems to understand the existing page boundaries but the text that is retrieved for an individual page is almost all of the text in the document. Since the generates a good ToC - I want to use that and the page numbers that are provided from that ToC to identify the specific pages that I want to further parse and extract data.
I will observe that I ran into similar problems trying to use pdfminer. I could retrieve all of the text but not just the text from a specific, specified page.
Try the following to get the text from any specific page of that pdf.
import fitz
path = r''
doc = fitz.open(path)
page = doc.loadPage(1) #put here the page number
page_to_text = page.getText("text")
print(page_to_text)