I am currently on the lookout for a text indexer for my python program. I shortlisted Solr, a Lucene project and Whoosh, which is native to python. I searched a lot of documentation on support for doc, docx and pdf files, and Solr kept pointing me to the Tika package, a version of which is integrated with Solr.
The results dont mention in certain terms if any package has inbuilt support for the three formats. Does Whoosh and Solr support them? Which other open-source indexer natively reads these formats?
With Solr 1.4 or later you can have Word and PDF files uploaded and indexed on the fly; see: http://wiki.apache.org/solr/ExtractingRequestHandler
Solr's ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.