I have installed apache Tika 1.8 and it is running perfectly except the OCR part is not working. I have Tesseract installed and it is also working properly. When I try to send a pdf with an image on it I get the following.
WARNING: Tesseract OCR is installed and will be automatically applied to image f iles unless you've excluded the TesseractOCRParser from the default parser. Tesseract may dramatically slow down content extraction (TIKA-2359). As of Tika 1.15 (and prior versions), Tesseract is automatically called. In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
Can I configure the TikaConfig using the command line utility ? Or do I have to clone the project and update poms and rebuild. I really do not want to have to do that.
There is some info here on how to use the command line utility and the TikaConfig but I cannot figure out how to enable TesseractOCRParser with it.
Any help, greatly appreciated.
OK so with the help of this post on the Apache Tika Forum Thank you guys.
I managed to get it working. Its a hack but It works. What I did was extract the Tika-app Jar file. Then locate the PDFParser.properties and change the following properties like this
extractInlineImages true
extractUniqueInlineImagesOnly false
ocrStrategy ocr_and_text_extraction
Then locate TesseractOCRConfig.properties. And change this one property to 1..
enableImageProcessing=1
Save the above properties files. Zip it all up again. And use your new zipped up jar file and it will now extract text and text from images from a pdf file.