Search code examples
htmlnlprtftext-processingdoc

Library which provides plain text access / iteration in multiple common document formats?


I'm interested to find a library for NLP/text processing purposes which presents a common interface for accessing text in the most common text formats:

  • Microsoft Word .doc and possibly .docx
  • RTF
  • HTML
  • "plain text"

I want something that ignores just about all information in the document but the text, but it should unify features such as:

  • Inline vs block formatting (blocks are like paragraphs but inline style changes are ignored)
  • All character encodings, entities, etc, UTFs should come out the same (UTF-8 or UTF-16 probably)
  • Configurable for various plain text formats, such as those intended for word wrapping vs those with hard-coded linebreaks
  • Having methods to get a character / word / sentence at a time, with the same semantics no matter what the underlying document format
  • Aware of ambiguities such as hyphens at ends of lines, periods which may be both part of an acronym and the end of a sentence.

I'm still happy if it only supports any two formats and only some of my features above.

Googling hasn't been successful but I'd be surprised if such things don't exist. What would NLP people use for processing large amounts of real-world text? Any platform / programming language is OK since this is hard to find. Open source so I can contribute is best.


(If this is deemed off topic and closed I would at least appreciate a recommendation of what other Stack Exchange site, or what other forum to ask such a question on.)


Solution

  • You might need two steps: get the content out of the file and then analyze it with some NLP toolkit. Step one could be done with Apache Tika. For step 2 the best-known alternatives are probably Gate, Apache UIMA, and OpenNLP. Note that there might be some overlap, for example UIMA might already have a component that makes use of Tika.