Search code examples
python-3.xpdfannotationspymupdf

I am trying to use Fitz to extract data from a pdf that contains text in a very unstructured format. But it's returning none at the first step


Here's the code I have been trying with the output:

import fitz
import pandas as pd 
doc = fitz.open('xyz.pdf')
page1 = doc[0]
words = page1.get_text("words")

first_annots=[]

rec=page1.first_annot.rect

rec


Output: output of above

the output I am expecting is all text rectangles to be identified and called separately. Here's where i found the code that i am implementing: https://www.analyticsvidhya.com/blog/2021/06/data-extraction-from-unstructured-pdfs/


Solution

  • Independent from your overall intention (to parse unstructured text): Accessing the page's annotations via page.first_annot makes no sense at all.

    Your exception is caused by the fact that that page page has no annotations, and therefore page.first_annot is None of course.

    Again: whether or not there are annotations has nothing to do with the text of the page. Simply do not access page.first_annot.