I'm interested to find a library for NLP/text processing purposes which presents a common interface for accessing text in the most common text formats:
.doc
and possibly .docx
I want something that ignores just about all information in the document but the text, but it should unify features such as:
I'm still happy if it only supports any two formats and only some of my features above.
Googling hasn't been successful but I'd be surprised if such things don't exist. What would NLP people use for processing large amounts of real-world text? Any platform / programming language is OK since this is hard to find. Open source so I can contribute is best.
(If this is deemed off topic and closed I would at least appreciate a recommendation of what other Stack Exchange site, or what other forum to ask such a question on.)
You might need two steps: get the content out of the file and then analyze it with some NLP toolkit. Step one could be done with Apache Tika. For step 2 the best-known alternatives are probably Gate, Apache UIMA, and OpenNLP. Note that there might be some overlap, for example UIMA might already have a component that makes use of Tika.