Search code examples
linuxms-officeoffice-2007

Convert Microsoft Office documents to Text


I'm looking for a library (or command line tool) to turn MS Office documents into either plaintext or HTML (for conversion to text).

It must run on Linux (not via Wine!).

I found antiword, but the last release was 2005, so it won't read the new Office 2007 formats.

I need it to read Word, Excel and Powerpoint documents


Solution

  • The Apache POI library can extract text from office formats. This is used by Tika in Lucene. Tika can be executed as a command line tool:

    curl http://.../document.doc \
      | java -jar tika-app-x.y.jar --text \
      | grep -q keyword