Using PyPDF2, I am trying to return text from a PDF, filtered by the font weight & style. However, I am so far unable to figure out how to access the "fontDict" object in the visitor function below.
type(fontDict) returns a <class 'PyPDF2.generic._data_structures.DictionaryObject'>, but I cannot figure out which method returns the keyword / values?
from PyPDF2 import PdfReader
reader = PdfReader("E:\\Programming\\nlp\\documents_to_search\\GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[0]
def visitor_body(text, cm, tm, fontDict, fontSize):
print(fontDict['/BaseFont'])
page.extract_text(visitor_text=visitor_body)
The source code for the DictionaryObject class indicates that it inherits from PdfObject
and dict
, so you should be able to access the keys and values as you would for any dict
instance:
def visitor_body(text, cm, tm, fontDict, fontSize):
keys = fontDict.keys()
values = fontDict.values()