Search code examples
javapowerpointopenoffice.orgdocxdoc

Looking for a library for parsing and extracting objects from ppt, pptx, doc, docx files


I am looking for a library that can open a ppt, pptx, doc, docx files parse it and extract all objects from it.

for example, in ppt it can extract all object properties like images, text, tables autoshapes etc.. then provide me with object location/size and formatting like font size/color/bold etc.. and for images the ability to save each image to a jpg file. The library should also be able to take a snapshot of the whole slide.

I have tried aspose for doing this, but it wasn't accurate in getting this information. doesn't extract all properties plus it's export as image isn't accurate. Is there any ideas in using open office library for doing that?

I am open to use Java or a C++ library.


Solution

  • At work we used the openoffice Java api to extract the images from ppt/pptx files. I used the docs from here. I am pretty sure you can use the info in that guide to do what you need.

    good luck.