I know there are many libraries to extract text from PDF. Specifically, I've been having some difficulty with pymupdf.
From the documentation here: https://pymupdf.readthedocs.io/en/latest/app4.html#sequencetypes
I was hoping to use select()
to pick an interval of pages, and then use getText()
This is the doc I am using linear_regression.pdf
import fitz
s = [1, 2]
doc = fitz.open('linear_regression.pdf')
selection = doc.select(s)
text = selection.getText(s)
But I get this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-23-c05917f260e7> in <module>()
6 # print(selection)
7 # text = doc.get_page_text(3, "text")
----> 8 text = selection.getText(s)
9 text
AttributeError: 'NoneType' object has no attribute 'getText'
So I'm assuming select()
is not being used right
thanks so much
select
here, according to the documentation, modifies doc
internally and does not return anything. In Python, if a function does not explicitly return anything, it will return None
, which is why you see that error.
However, Document
provides a method called get_page_text
which allows you to get the text from a specific page (0 indexed). So for your example, you could write:
import fitz
s = [1, 2] # pages 2 and 3
doc = fitz.open('linear_regression.pdf')
text_by_page = [doc.get_page_text(i) for i in s]
Now, you have a list, where each item in the list is the text from a different desired page. A simple way to convert this to a string is:
text = ' '.join(text_by_page)
which joins the two pages with a space between the last word of the first page and the first word of the last (as if there was no page break at all).