Search code examples
pythonpdfpdfminer

pdfminer - extract text behind LTFigure object


I am extracting text from pdf files using python pdfminer library (see docs).

However, pdfminer seems unable to extract all texts in some files and extracts LTFigure object instead. Assuming from position of this object it "covers" some of the text and thus this text is not extracted.

Both pdf file and short jupyter notebook with the code extracting information from pdf are in the Github repository I created specifically in order to ask this question:

https://github.com/druskacik/ltfigure-pdfminer

I am not an expert on how pdf files work but common sense tells me that if I can look for the text using control + f in browser, it should be extractable.

I have considered using some other library but the problem is that I also need positions of the extracted words (in order to use them for my machine learning model), which is a functionality only pdfminer seems to provide.


Solution

  • Ok, so I finally came up with the solution. It's very simple - it's possible to iterate over LTFigure object in the same way you would iterate over e.g. LTTextBox object.

    interpreter.process_page(page)
    layout = device.get_result()
    
    for lobj in layout:
        if isinstance(lobj, LTTextBox):
            for element in lobj:
                if isinstance(element, LTTextLine):
                    text = element.get_text()
                    print(text)
    
        elif isinstance(lobj, LTFigure):
            for element in lobj:
                if isinstance(element, LTChar):
                    text = element.get_text()
                    print(text)
    
    

    Note that the correct way (as to make sure that the parser reads everything in the document) would be to iterate pdfminer objects recursively, as shown here: How does one obtain the location of text in a PDF with PDFMiner?