How to find figure captions in a PDF?

I want to develop a Python script that can find all of the figure captions within a PDF. I was wondering if it is possible to gather all the figure captions and append them to an array as it is searching for new figure captions.

I have tried searching for the word "Figure" and then grabbing the entire sentence that is present within it, but it is not efficient because it wouldn't find all of the sentences within the caption, and instead, only the sentence that is separated with a period.

EDIT The following is a sample PDF that I intend to be working with. As you see, the word Fig.1 is written right below the image.

NEW EDIT Here is the full HTML file that was converted with pdf2htmlEX: https://drive.google.com/open?id=1hYriVrTlwmxR35A2Jy7mKoO4ns2oWe3Z

Solution

This answer is not complete, will update it as we go through the problem.

Copy of original PDF:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC335638/pdf/pnas00677-0355.pdf

Step 1 - Try pypdf

# importing required modules 
import PyPDF2 

# creating a pdf file object 
pdfFileObj = open('example.pdf', 'rb') 

# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

# printing number of pages in pdf file 
print(pdfReader.numPages) 

# creating a page object 
pageObj = pdfReader.getPage(0) 

# extracting text from page 
print(pageObj.extractText()) 

# closing the pdf file object 
pdfFileObj.close()

This wasn't suitable as even the words weren't separated by spaces.

Step 2 - try pdf2htmlEX

Suggested we try https://github.com/coolwanglu/pdf2htmlEX to convert to html and then develop appropriate selectors to use with beautifulsoup4.

pdf2htmlex produced html where every single word was surrounded by tags and didn't help us at all.

Step 3 - try pdfminer.six

https://github.com/pdfminer/pdfminer.six

This is much better, though still not perfect:

CIRCUITS IN THE CEREBELLAR CONTROL OF MOVEMENT

BY JOHN C. ECCLES

AMA/ERF INSTITUTE FOR BIOMEDICAL RESEARCH, CHICAGO

Communicated May 16, 1967

Neuroanatomists have generally recognized that the cerebellum provides the greatest challenge in our initial efforts to discern functional meaning in neuronal patterns because there is a stereotyped and simple geometrical arrangement of its Presumably, it is for this reason that there is the unique neuronal constituents. most refined knowledge of microstructure that is available in the central nervous system. The pioneer investigations of Ram6n y Cajall have led in recent times to fascinating developments concerning microstructure, geometrical arrangements, and numerical assessment.2

As shown in Figure 1,3 there are only two kinds of afferent fibers conveying information to the cerebellum, the climbing fibers (cf) and the mossy fibers (mf); and there is only one type of efferent fiber from the cerebellum, the axons of the Purkinje cells (Pc), which terminate in the cerebellar nuclei (cn) and otherwise largely in Deiters' nucleus. The climbing fiber is uniquely distributed to a single

FIG. 1.-Perspective drawing by Fox3 of a part of a folium of the cerebellar cortex. The principal

components are shown in diagrammatic form, and are described in the text.

336

VOL. 58, 1967

PHYSIOLOGY: J. C. ECCLES

337

We can then run this code on the output:

import re

# Read In Text
fileName = "sample.txt"
pdfTextfile = open(fileName, "r")
pdfText = pdfTextfile.read()

# Split text into blocks separated by double line break.
blocks = pdfText.split("\n\n")

# Remove all new lines within blocks to remove arbitary line breaks
blocks = map(lambda x : x.replace("\n", ""), blocks)

# Which blocks are figure captions?
captions = []
for block in blocks:
    if re.search('^fig', block, re.IGNORECASE):
        captions.append(block)

# Done!
for caption in captions:
    print(caption)
    print()

This may need some more tweaking, as the output of pdfminer.six is not quite perfect.

Step 4 - Try Tesseract

I was curious to see how good OCR would be in this case. First convert the pdf to images. Then install the following:

sudo apt install tesseract-ocr
pip install pyocr

This code will perform OCR on the image.

from PIL import Image
import sys

import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)

tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))

imageFile = "page_1.jpg"

txt = tool.image_to_string(
    Image.open(imageFile),
    lang=lang,
    builder=pyocr.builders.TextBuilder()
)
open("page_1.txt","w").write(txt)

This produces better blocks of text, but has a few typos:

CIRCUITS IN THE CEREBELLAR CONTROL OF MOVEMENT

By Joun C. Eccuss

AMA/ ERF INSTITUTE FOR BIOMEDICAL RESEARCH, CHICAGO

Communicated May 16, 1967

Neuroanatomists have generally recognized that the cerebellum provides the greatest challenge in our initial efforts to discern functional meaning in neuronal patterns because there is a stereotyped and simple geometrical arrangement of its unique neuronal constituents. Presumably, it is for this reason that there is the most refined knowledge of microstructure that is available in the central nervous system. The pioneer investigations of Ram6n y Cajal! have led in recent times to fascinating developments concerning microstructure, geometrical arrangements, and numerical assessment.’

As shown in Figure 1,* there are only two kinds of afferent fibers conveying information to the cerebellum, the climbing fibers (cf) and the mossy fibers (m/f); and there is only one type of efferent fiber from the cerebellum, the axons of the Purkinje cells (Pc), which terminate in the cerebellar nuclei (en) and otherwise largely in Deiters’ nucleus. The climbing fiber is uniquely distributed to a single

Fic. 1.—Perspective drawing by Fox? of a part of a folium of the cerebellar cortex. The principal components are shown in diagrammatic form, and are described in the text.

336