Whoosh: Indexing MS documents, PDFs

I want to make a document search using python. Solr was no-go as Java hosting was a constraint.

So whoosh seems the obvious option. But it seems not to natively index doc or pdf files (as Solr can). What is the way to make it deirectly index these files?

Solution

Whoosh just needs the extracted text from those documents. While the Whoosh library wont do that extraction for you, there are Python libraries that will extract the text for you, like pdf miner, catdoc or antiword.

See these two discussions for more information:

Best way to extract text from a Word doc without using COM/automation?
How to extract just plain text from .doc & .docx files? (unix)