Search code examples
pypdf

PyPDF2 - Accessing visitor function parameters


Using PyPDF2, I am trying to return text from a PDF, filtered by the font weight & style. However, I am so far unable to figure out how to access the "fontDict" object in the visitor function below.

type(fontDict) returns a <class 'PyPDF2.generic._data_structures.DictionaryObject'>, but I cannot figure out which method returns the keyword / values?

from PyPDF2 import PdfReader


reader = PdfReader("E:\\Programming\\nlp\\documents_to_search\\GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[0]

def visitor_body(text, cm, tm, fontDict, fontSize):
    print(fontDict['/BaseFont'])
        
        
page.extract_text(visitor_text=visitor_body)

Solution

  • The source code for the DictionaryObject class indicates that it inherits from PdfObject and dict, so you should be able to access the keys and values as you would for any dict instance:

    def visitor_body(text, cm, tm, fontDict, fontSize):
        keys = fontDict.keys()
        values = fontDict.values()