I'm running tika-server-1.23.jar with tesseract and extracting text from files using curl via php. Sometimes it takes too long to run with OCR so I'd like, occasionally, to exclude running tesseract. I can do this by inserting
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
in the tika config xml file but this means it never runs tesseract.
Can I force the tika server to skip using tesseract selectively at each request via curl and, if so, how?
I've got a workaround where I'm running two instances of the tika server each with a different config file listening on different ports but this is sub-optimal.
Thanks in advance.
You can set the OCR strategy using headers for PDF files, which includes an option not to OCR:
curl -T test.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: no_ocr"
There isn't really an equivalent for other file types, but there is a similar header prefix call X-Tika-OCR that allows you to set configuration on the TesseractOCRConfig instance when used on any file type.
You have some options which could be of interest in your scenario:
So, for example, if you want to skip a file you could set the max file size to 0 which means it will not be processed:
curl -T testOCR.jpg http://localhost:9998/tika --header "X-Tika-OCRmaxFileSizeToOcr: 0"
Or set the path to /dummy:
curl -T testOCR.jpg http://localhost:9998/tika --header "X-Tika-OCRtesseractPath: /dummy"
You can of course also use these headers with PDF files too, should you wish.