I'm looking for a library (or command line tool) to turn MS Office documents into either plaintext or HTML (for conversion to text).
It must run on Linux (not via Wine!).
I found antiword, but the last release was 2005, so it won't read the new Office 2007 formats.
I need it to read Word, Excel and Powerpoint documents
The Apache POI library can extract text from office formats. This is used by Tika in Lucene. Tika can be executed as a command line tool:
curl http://.../document.doc \
| java -jar tika-app-x.y.jar --text \
| grep -q keyword