Search code examples
javatesseractapache-tika

Possible to run two ContentHandlers for a single parse in Apache-Tika?


I'm using Apache Tika to parse documents and generate both a plaintext version and an HTML preview of the document. I'm able to generate both just fine if I call the parse function twice and pass in two separate ContentHandlers— this works great for text only documents. But when I get documents that require OCR with tesseract, it's a bit of a problem— it's extremely wasteful to call the parse function twice because it does the OCR (which can take a minute or so) twice as well.

I know I can write my own ContentHandler, but just wondering if anyone knows of an out-of-the-box solution for this? Much appreciated!


Solution

  • Good news - Apache Tika provides something out of the box for this!

    TeeContentHandler - Content handler proxy that forwards the received SAX events to zero or more underlying content handlers.

    Just create your 2+ real Content Handlers, pass those to the constructor of TeeContentHandler, then hand the TeeContentHandler to Tika when you do the parse