python solr full-text-search whoosh solr-cell

Text indexers (for python) with inbuilt support for doc, docx and pdf files

I am currently on the lookout for a text indexer for my python program. I shortlisted Solr, a Lucene project and Whoosh, which is native to python. I searched a lot of documentation on support for doc, docx and pdf files, and Solr kept pointing me to the Tika package, a version of which is integrated with Solr.

The results dont mention in certain terms if any package has inbuilt support for the three formats. Does Whoosh and Solr support them? Which other open-source indexer natively reads these formats?

Solution

With Solr 1.4 or later you can have Word and PDF files uploaded and indexed on the fly; see: http://wiki.apache.org/solr/ExtractingRequestHandler

Solr's ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.

For Loops in Python (Output Smallest Input)
How to parse a function with ply in Python?
Quantum Circuit not drawing on Colab
Prime factorization using list comprehension in Python
How do I place two or more ASCII images side by side?
Unable to get local issuer certificate when using requests
Get mutual settlements from records using SQL
How to convert a file to utf-8 in Python?
SQLAlchemy join & filter
How to access FastAPI backend from a different machine/IP on the same local network?
Python does not see pygraphviz
Default filter expression to "match anything"
Django Scraper Matching Issue: match_maker Only Returns 4 Members Instead of 150
Flask App works with Curl but not with HTTP request
Adding a combination in a datafra, which is missing. Pandas data frame
How to inherit from Python None
How to make a triangle of x's in python?
Using Yaml Anchors across different files using python / ruamel.yaml
Python: Create strikethrough / strikeout / overstrike string type
Boolean operators: Branching using Boolean variables ( python)
Django is taking a long time to load
How to find the most common frequeny in Time series
Adjust Matplotlib Polar Plot to Show Sub Degree Motion (AKA Stretch a polar plot() slice)
pandas: Convert string column to ordered Category?
Problem scraping table row data into an array
What's win32con module in python? Where can I find it?
Why Am I Seeing Multiple python.exe In Different Locations On a Virtual Machine?
Does python3 asyncio use a work stealing scheduler like Rust Tokio?
What is the best way to Install Conda on MacOS (Apple/Mac)?
Configuration of Django+WSGI+Apache