Search code examples
ocrtesseractapache-tika

How do I force tika server to exclude the TesseractOCRParser using curl


I'm running tika-server-1.23.jar with tesseract and extracting text from files using curl via php. Sometimes it takes too long to run with OCR so I'd like, occasionally, to exclude running tesseract. I can do this by inserting

<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>

in the tika config xml file but this means it never runs tesseract.

Can I force the tika server to skip using tesseract selectively at each request via curl and, if so, how?

I've got a workaround where I'm running two instances of the tika server each with a different config file listening on different ports but this is sub-optimal.

Thanks in advance.


Solution

  • You can set the OCR strategy using headers for PDF files, which includes an option not to OCR:

    curl -T test.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: no_ocr"
    

    There isn't really an equivalent for other file types, but there is a similar header prefix call X-Tika-OCR that allows you to set configuration on the TesseractOCRConfig instance when used on any file type.

    You have some options which could be of interest in your scenario:

    • maxFileSizeToOcr - which you could set to 0
    • timeout - which you could set to the timeout you are willing to give
    • tesseractPath - which you can set to anything, as if it can't find it, it can't execute

    So, for example, if you want to skip a file you could set the max file size to 0 which means it will not be processed:

    curl -T testOCR.jpg http://localhost:9998/tika  --header "X-Tika-OCRmaxFileSizeToOcr: 0"
    

    Or set the path to /dummy:

    curl -T testOCR.jpg http://localhost:9998/tika  --header "X-Tika-OCRtesseractPath: /dummy"
    

    You can of course also use these headers with PDF files too, should you wish.