Search code examples
pythonpdfminer

Where can I find PDFMiner API definitions?


Is there a good API definition for the Python PDFMiner package?

For example I can see from the source code that LTText contains x0, y0, x1, y1 and some text and there is a get_text() method that returns the text - but is the intention to just access x0... directly?

In which case why wrap the text using _text and get_text()?


Solution

  • The project isn't heavily documented, so you'll have to figure it out on your own. There is, however, some documentation in the form of basic explanations of the main classes and structure.

    For your specific question, LTText functions like an abstract base class. Some objects that inherit from LTText override the get_text method and do something more complicated, like LTTextContainer:

    class LTTextContainer(LTExpandableContainer, LTText):
        def __init__(self):
            LTText.__init__(self)
            LTExpandableContainer.__init__(self)
            return
    
        def get_text(self):
            return ''.join(obj.get_text() for obj in self if isinstance(obj, LTText))
    

    Usually getter and setter methods wrap functionality that may be useful to override in subclasses or update state that depends on the input. For example, LTComponent.set_bbox updates six other attributes besides self.bbox.