Search code examples
pythonwhoosh

Whoosh: Indexing MS documents, PDFs


I want to make a document search using python. Solr was no-go as Java hosting was a constraint.

So whoosh seems the obvious option. But it seems not to natively index doc or pdf files (as Solr can). What is the way to make it deirectly index these files?


Solution

  • Whoosh just needs the extracted text from those documents. While the Whoosh library wont do that extraction for you, there are Python libraries that will extract the text for you, like pdf miner, catdoc or antiword.

    See these two discussions for more information: