I want to develop a Python script that can find all of the figure captions within a PDF. I was wondering if it is possible to gather all the figure captions and append them to an array as it is searching for new figure captions.
I have tried searching for the word "Figure" and then grabbing the entire sentence that is present within it, but it is not efficient because it wouldn't find all of the sentences within the caption, and instead, only the sentence that is separated with a period.
EDIT The following is a sample PDF that I intend to be working with. As you see, the word Fig.1 is written right below the image.
NEW EDIT Here is the full HTML file that was converted with pdf2htmlEX: https://drive.google.com/open?id=1hYriVrTlwmxR35A2Jy7mKoO4ns2oWe3Z
This answer is not complete, will update it as we go through the problem.
Copy of original PDF:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC335638/pdf/pnas00677-0355.pdf
Step 1 - Try pypdf
# importing required modules
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
This wasn't suitable as even the words weren't separated by spaces.
Step 2 - try pdf2htmlEX
Suggested we try https://github.com/coolwanglu/pdf2htmlEX to convert to html and then develop appropriate selectors to use with beautifulsoup4.
pdf2htmlex produced html where every single word was surrounded by tags and didn't help us at all.
Step 3 - try pdfminer.six
https://github.com/pdfminer/pdfminer.six
This is much better, though still not perfect:
CIRCUITS IN THE CEREBELLAR CONTROL OF MOVEMENT
BY JOHN C. ECCLES
AMA/ERF INSTITUTE FOR BIOMEDICAL RESEARCH, CHICAGO
Communicated May 16, 1967
Neuroanatomists have generally recognized that the cerebellum provides the greatest challenge in our initial efforts to discern functional meaning in neuronal patterns because there is a stereotyped and simple geometrical arrangement of its Presumably, it is for this reason that there is the unique neuronal constituents. most refined knowledge of microstructure that is available in the central nervous system. The pioneer investigations of Ram6n y Cajall have led in recent times to fascinating developments concerning microstructure, geometrical arrangements, and numerical assessment.2
As shown in Figure 1,3 there are only two kinds of afferent fibers conveying information to the cerebellum, the climbing fibers (cf) and the mossy fibers (mf); and there is only one type of efferent fiber from the cerebellum, the axons of the Purkinje cells (Pc), which terminate in the cerebellar nuclei (cn) and otherwise largely in Deiters' nucleus. The climbing fiber is uniquely distributed to a single
FIG. 1.-Perspective drawing by Fox3 of a part of a folium of the cerebellar cortex. The principal
components are shown in diagrammatic form, and are described in the text.
336
VOL. 58, 1967
PHYSIOLOGY: J. C. ECCLES
337
We can then run this code on the output:
import re
# Read In Text
fileName = "sample.txt"
pdfTextfile = open(fileName, "r")
pdfText = pdfTextfile.read()
# Split text into blocks separated by double line break.
blocks = pdfText.split("\n\n")
# Remove all new lines within blocks to remove arbitary line breaks
blocks = map(lambda x : x.replace("\n", ""), blocks)
# Which blocks are figure captions?
captions = []
for block in blocks:
if re.search('^fig', block, re.IGNORECASE):
captions.append(block)
# Done!
for caption in captions:
print(caption)
print()
This may need some more tweaking, as the output of pdfminer.six is not quite perfect.
Step 4 - Try Tesseract
I was curious to see how good OCR would be in this case. First convert the pdf to images. Then install the following:
sudo apt install tesseract-ocr
pip install pyocr
This code will perform OCR on the image.
from PIL import Image
import sys
import pyocr
import pyocr.builders
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
imageFile = "page_1.jpg"
txt = tool.image_to_string(
Image.open(imageFile),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
open("page_1.txt","w").write(txt)
This produces better blocks of text, but has a few typos:
CIRCUITS IN THE CEREBELLAR CONTROL OF MOVEMENT
By Joun C. Eccuss
AMA/ ERF INSTITUTE FOR BIOMEDICAL RESEARCH, CHICAGO
Communicated May 16, 1967
Neuroanatomists have generally recognized that the cerebellum provides the greatest challenge in our initial efforts to discern functional meaning in neuronal patterns because there is a stereotyped and simple geometrical arrangement of its unique neuronal constituents. Presumably, it is for this reason that there is the most refined knowledge of microstructure that is available in the central nervous system. The pioneer investigations of Ram6n y Cajal! have led in recent times to fascinating developments concerning microstructure, geometrical arrangements, and numerical assessment.’
As shown in Figure 1,* there are only two kinds of afferent fibers conveying information to the cerebellum, the climbing fibers (cf) and the mossy fibers (m/f); and there is only one type of efferent fiber from the cerebellum, the axons of the Purkinje cells (Pc), which terminate in the cerebellar nuclei (en) and otherwise largely in Deiters’ nucleus. The climbing fiber is uniquely distributed to a single
Fic. 1.—Perspective drawing by Fox? of a part of a folium of the cerebellar cortex. The principal components are shown in diagrammatic form, and are described in the text.
336