Search code examples
parsingstructureextractdata-extractionopendocument

Extracting structural data from ODP or ODF files


I'm trying to extract the information hierarchy within ODP (OpenDocument Presentation) files : Titles, subtitles, body text...

Do you know any tool or technique that would do the job?

Else, is there a mean to parse those ODP documents in order to extract styling informations? So I can later deduce the document structure from its styling.

I'm afraid the structure of the XML file inside the ODP file could depend on softwares or versions. So that, I'd rather find a high level solution than parsing directly this XML file.


Solution

  • As I couldn't find any tool that would enable to extract outline, titles, text... from presentation files, I created Exide, an open source API supporting ODP, PPTX and beamer files, it enables:

    • Slide title extraction
    • Slide body text extraction
    • Named-entities recognition (unaccurate)
    • Emphasized text recognition
    • URLs recognition
    • Structure detection and outline generation
    • Recognition of the following silde types :
      • Introduction
      • Conclusion
      • Definition
      • Example
      • Table of contents
      • References
      • Section header

    For more information, check out the github page of the project.