parsing structure extract data-extraction opendocument

Extracting structural data from ODP or ODF files

I'm trying to extract the information hierarchy within ODP (OpenDocument Presentation) files : Titles, subtitles, body text...

Do you know any tool or technique that would do the job?

Else, is there a mean to parse those ODP documents in order to extract styling informations? So I can later deduce the document structure from its styling.

I'm afraid the structure of the XML file inside the ODP file could depend on softwares or versions. So that, I'd rather find a high level solution than parsing directly this XML file.

Solution

As I couldn't find any tool that would enable to extract outline, titles, text... from presentation files, I created Exide, an open source API supporting ODP, PPTX and beamer files, it enables:

Slide title extraction
Slide body text extraction
Named-entities recognition (unaccurate)
Emphasized text recognition
URLs recognition
Structure detection and outline generation
Recognition of the following silde types :
- Introduction
- Conclusion
- Definition
- Example
- Table of contents
- References
- Section header

For more information, check out the github page of the project.