Search code examples
pythonpdfnlppymupdf

how to extract text from a selection of pages in a larger pdf using pymupdf?


I know there are many libraries to extract text from PDF. Specifically, I've been having some difficulty with pymupdf. From the documentation here: https://pymupdf.readthedocs.io/en/latest/app4.html#sequencetypes I was hoping to use select() to pick an interval of pages, and then use getText() This is the doc I am using linear_regression.pdf

import fitz
s = [1, 2]
doc = fitz.open('linear_regression.pdf')
selection = doc.select(s)
text = selection.getText(s)

But I get this error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-c05917f260e7> in <module>()
      6 # print(selection)
      7 # text = doc.get_page_text(3, "text")
----> 8 text = selection.getText(s)
      9 text

AttributeError: 'NoneType' object has no attribute 'getText'

So I'm assuming select() is not being used right thanks so much


Solution

  • select here, according to the documentation, modifies doc internally and does not return anything. In Python, if a function does not explicitly return anything, it will return None, which is why you see that error.

    However, Document provides a method called get_page_text which allows you to get the text from a specific page (0 indexed). So for your example, you could write:

    import fitz
    s = [1, 2] # pages 2 and 3
    doc = fitz.open('linear_regression.pdf')
    text_by_page = [doc.get_page_text(i) for i in s]
    

    Now, you have a list, where each item in the list is the text from a different desired page. A simple way to convert this to a string is:

    text = ' '.join(text_by_page)
    

    which joins the two pages with a space between the last word of the first page and the first word of the last (as if there was no page break at all).