I am extracting text from pdf files using python pdfminer library (see docs).
However, pdfminer seems unable to extract all texts in some files and extracts LTFigure
object instead. Assuming from position of this object it "covers" some of the text and thus this text is not extracted.
Both pdf file and short jupyter notebook with the code extracting information from pdf are in the Github repository I created specifically in order to ask this question:
https://github.com/druskacik/ltfigure-pdfminer
I am not an expert on how pdf files work but common sense tells me that if I can look for the text using control + f
in browser, it should be extractable.
I have considered using some other library but the problem is that I also need positions of the extracted words (in order to use them for my machine learning model), which is a functionality only pdfminer seems to provide.
Ok, so I finally came up with the solution. It's very simple - it's possible to iterate over LTFigure
object in the same way you would iterate over e.g. LTTextBox
object.
interpreter.process_page(page)
layout = device.get_result()
for lobj in layout:
if isinstance(lobj, LTTextBox):
for element in lobj:
if isinstance(element, LTTextLine):
text = element.get_text()
print(text)
elif isinstance(lobj, LTFigure):
for element in lobj:
if isinstance(element, LTChar):
text = element.get_text()
print(text)
Note that the correct way (as to make sure that the parser reads everything in the document) would be to iterate pdfminer
objects recursively, as shown here: How does one obtain the location of text in a PDF with PDFMiner?