Search code examples
pythonpython-3.xpdffigure

How to find figure captions in a PDF?


I want to develop a Python script that can find all of the figure captions within a PDF. I was wondering if it is possible to gather all the figure captions and append them to an array as it is searching for new figure captions.

I have tried searching for the word "Figure" and then grabbing the entire sentence that is present within it, but it is not efficient because it wouldn't find all of the sentences within the caption, and instead, only the sentence that is separated with a period.

EDIT The following is a sample PDF that I intend to be working with. As you see, the word Fig.1 is written right below the image. enter image description here

NEW EDIT Here is the full HTML file that was converted with pdf2htmlEX: https://drive.google.com/open?id=1hYriVrTlwmxR35A2Jy7mKoO4ns2oWe3Z


Solution

  • This answer is not complete, will update it as we go through the problem.

    Copy of original PDF:

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC335638/pdf/pnas00677-0355.pdf

    Step 1 - Try pypdf

    # importing required modules 
    import PyPDF2 
    
    # creating a pdf file object 
    pdfFileObj = open('example.pdf', 'rb') 
    
    # creating a pdf reader object 
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
    
    # printing number of pages in pdf file 
    print(pdfReader.numPages) 
    
    # creating a page object 
    pageObj = pdfReader.getPage(0) 
    
    # extracting text from page 
    print(pageObj.extractText()) 
    
    # closing the pdf file object 
    pdfFileObj.close() 
    

    This wasn't suitable as even the words weren't separated by spaces.

    Step 2 - try pdf2htmlEX

    Suggested we try https://github.com/coolwanglu/pdf2htmlEX to convert to html and then develop appropriate selectors to use with beautifulsoup4.

    pdf2htmlex produced html where every single word was surrounded by tags and didn't help us at all.

    Step 3 - try pdfminer.six

    https://github.com/pdfminer/pdfminer.six

    This is much better, though still not perfect:

    CIRCUITS IN THE CEREBELLAR CONTROL OF MOVEMENT

    BY JOHN C. ECCLES

    AMA/ERF INSTITUTE FOR BIOMEDICAL RESEARCH, CHICAGO

    Communicated May 16, 1967

    Neuroanatomists have generally recognized that the cerebellum provides the greatest challenge in our initial efforts to discern functional meaning in neuronal patterns because there is a stereotyped and simple geometrical arrangement of its Presumably, it is for this reason that there is the unique neuronal constituents. most refined knowledge of microstructure that is available in the central nervous system. The pioneer investigations of Ram6n y Cajall have led in recent times to fascinating developments concerning microstructure, geometrical arrangements, and numerical assessment.2

    As shown in Figure 1,3 there are only two kinds of afferent fibers conveying information to the cerebellum, the climbing fibers (cf) and the mossy fibers (mf); and there is only one type of efferent fiber from the cerebellum, the axons of the Purkinje cells (Pc), which terminate in the cerebellar nuclei (cn) and otherwise largely in Deiters' nucleus. The climbing fiber is uniquely distributed to a single

    FIG. 1.-Perspective drawing by Fox3 of a part of a folium of the cerebellar cortex. The principal

    components are shown in diagrammatic form, and are described in the text.

    336

    VOL. 58, 1967

    PHYSIOLOGY: J. C. ECCLES

    337

    We can then run this code on the output:

    import re
    
    # Read In Text
    fileName = "sample.txt"
    pdfTextfile = open(fileName, "r")
    pdfText = pdfTextfile.read()
    
    # Split text into blocks separated by double line break.
    blocks = pdfText.split("\n\n")
    
    # Remove all new lines within blocks to remove arbitary line breaks
    blocks = map(lambda x : x.replace("\n", ""), blocks)
    
    # Which blocks are figure captions?
    captions = []
    for block in blocks:
        if re.search('^fig', block, re.IGNORECASE):
            captions.append(block)
    
    # Done!
    for caption in captions:
        print(caption)
        print()
    

    This may need some more tweaking, as the output of pdfminer.six is not quite perfect.

    Step 4 - Try Tesseract

    I was curious to see how good OCR would be in this case. First convert the pdf to images. Then install the following:

    sudo apt install tesseract-ocr
    pip install pyocr
    

    This code will perform OCR on the image.

    from PIL import Image
    import sys
    
    import pyocr
    import pyocr.builders
    
    tools = pyocr.get_available_tools()
    if len(tools) == 0:
        print("No OCR tool found")
        sys.exit(1)
    
    tool = tools[0]
    print("Will use tool '%s'" % (tool.get_name()))
    
    langs = tool.get_available_languages()
    print("Available languages: %s" % ", ".join(langs))
    lang = langs[0]
    print("Will use lang '%s'" % (lang))
    
    imageFile = "page_1.jpg"
    
    txt = tool.image_to_string(
        Image.open(imageFile),
        lang=lang,
        builder=pyocr.builders.TextBuilder()
    )
    open("page_1.txt","w").write(txt)
    

    This produces better blocks of text, but has a few typos:

    CIRCUITS IN THE CEREBELLAR CONTROL OF MOVEMENT

    By Joun C. Eccuss

    AMA/ ERF INSTITUTE FOR BIOMEDICAL RESEARCH, CHICAGO

    Communicated May 16, 1967

    Neuroanatomists have generally recognized that the cerebellum provides the greatest challenge in our initial efforts to discern functional meaning in neuronal patterns because there is a stereotyped and simple geometrical arrangement of its unique neuronal constituents. Presumably, it is for this reason that there is the most refined knowledge of microstructure that is available in the central nervous system. The pioneer investigations of Ram6n y Cajal! have led in recent times to fascinating developments concerning microstructure, geometrical arrangements, and numerical assessment.’

    As shown in Figure 1,* there are only two kinds of afferent fibers conveying information to the cerebellum, the climbing fibers (cf) and the mossy fibers (m/f); and there is only one type of efferent fiber from the cerebellum, the axons of the Purkinje cells (Pc), which terminate in the cerebellar nuclei (en) and otherwise largely in Deiters’ nucleus. The climbing fiber is uniquely distributed to a single

    Fic. 1.—Perspective drawing by Fox? of a part of a folium of the cerebellar cortex. The principal components are shown in diagrammatic form, and are described in the text.

    336