Search code examples
pythonnlppdfminer

PDFMiner Extraction for Single Words - LTText LTTextBox


I am generating word x,y coordinates with PDFMiner in the below example, however the results are on a line by line basis. How can I split each word from another word, rather splitting groups of words line by line (see example below). I have tried several of the arguments in the PDFMiner tutorial. LTTextBox and LTText were both tried. Moreover, I cannot use beginning and end offsets normally used in text analytics.

This PDF is a good example, this is used in the code below.

http://www.africau.edu/images/default/sample.pdf

from pdfminer.layout import LAParams, LTTextBox, LTText
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator

#Imports Searchable PDFs and prints x,y coordinates
fp = open('C:\sample.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)

for page in pages:
    print('--- Processing ---')
    interpreter.process_page(page)
    layout = dev.get_result()
    for lobj in layout:
        if isinstance(lobj, LTText):
            x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
            print('At %r is text: %s' % ((x, y), text))

This returns the x,y coordinates for the searchable PDF as demonstrated below:

--- Processing ---
At (57.375, 747.903) is text: A Simple PDF File
At (69.25, 698.098) is text: This is a small demonstration .pdf file -
At (69.25, 674.194) is text: just for use in the Virtual Mechanics tutorials. More text. And more 
 text. And more text. And more text. And more text.

Wanted result (the coordinates are proxy for demonstration):

--- Processing ---
At (57.375, 747.903) is text: A
At (69.25, 698.098) is text: Simple
At (69.25, 674.194) is text: PDF
At (69.25, 638.338) is text: File

Solution

  • With PDFMiner, after going through each line (as you already did), you may only go through each character in the line.

    I did this with the code below, while trying to record the x, y of the first character per word and setting up a condition to split the words at each LTAnno (e.g. \n ) or .get_text() == ' ' empty space.

    from pdfminer.layout import LAParams, LTTextBox, LTText, LTChar, LTAnno
    from pdfminer.pdfpage import PDFPage
    from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
    from pdfminer.converter import PDFPageAggregator
    
    #Imports Searchable PDFs and prints x,y coordinates
    fp = open('C:\sample.pdf', 'rb')
    manager = PDFResourceManager()
    laparams = LAParams()
    dev = PDFPageAggregator(manager, laparams=laparams)
    interpreter = PDFPageInterpreter(manager, dev)
    pages = PDFPage.get_pages(fp)
    
    for page in pages:
        print('--- Processing ---')
        interpreter.process_page(page)
        layout = dev.get_result()
        x, y, text = -1, -1, ''
        for textbox in layout:
            if isinstance(textbox, LTText):
              for line in textbox:
                for char in line:
                  # If the char is a line-break or an empty space, the word is complete
                  if isinstance(char, LTAnno) or char.get_text() == ' ':
                    if x != -1:
                      print('At %r is text: %s' % ((x, y), text))
                    x, y, text = -1, -1, ''     
                  elif isinstance(char, LTChar):
                    text += char.get_text()
                    if x == -1:
                      x, y, = char.bbox[0], char.bbox[3]    
        # If the last symbol in the PDF was neither an empty space nor a LTAnno, print the word here
        if x != -1:
          print('At %r is text: %s' % ((x, y), text))
    

    The output looks as follows

    At (64.881, 747.903) is text: A
    At (90.396, 747.903) is text: Simple
    At (180.414, 747.903) is text: PDF
    At (241.92, 747.903) is text: File
    

    Perhaps you can optimize the conditions to detect the words for your requirements and liking. (e.g. cut punctuation marks .!? at the end of words)