Search code examples
pythonpdfocrimage-recognitionpdf-parsing

Identify and extract specific sections of a PDF document


I have several exams in PDF format. I want to programatically extract each question as a separate image/document. OCR is not ideal because it does not maintain code/equation formatting well. The end goal is to make flash cards with each card containing an image of an entire question. Questions can be on the same page, and can also be multi-part (e.g. 1a, 2f, etc.).

Currently, I'm considering using OCR to extract question tags (e.g. 1, 2, 3, etc.) and then finding their positions in the pdf and extracting an iamge from the start of one question to the start of the next. Is there any framework or software that can do this or provide some sort of alternative approach to make this easier?


Solution

  • Have a look at Science-Parse by Allen AI. It does a pretty decent job at extracting metadata from PDF documents. Often, its better than other text extracting software such as textract and pdfplumber.

    Extraction of mathematical formulae from PDF accurately has been a research topic for many years now. I have not found any open source projects/packages/softwares related to extracting mathematical formulae precisely, although there are a number of research papers which describe methods to do that such as this and this. (More research has been done on recognition of mathematical formula or converting them to a proper markup such as LaTeX, MathML, etc.) Most of these papers use information about the font, baseline, glyph bounding boxes, line spacing, etc. to correctly recognize mathematical formulae and extract them.

    For OCR, you can always use Infty. This is what the description for InftyReader says:

    InftyReader recognizes scanned images of printed scientific documents including Math formulae, an outputs the recognition results in various formats: XML format for InftyEditor, LaTeX, MathML, Human-Readable TeX for the blinds, etc.