Search code examples
solrtesseractapache-tikasolr9

Using Tesseract OCR with Solr 9.1


I had a set up running where I could extract in Solr (8.11.2 with tika 1.27) and get OCR from Tesseract (5.2.0).

To do this i had updated TesseractOCRConfig.properties inside tika-parsers-1.27.jar with

tesseractPath=C:/Tesseract-OCR
tessdataPath=C:/Tesseract-OCR/tessdata/
language=dan

I am now trying to replicate the setup with solr 9.1 (Tika 1.28.4) and same Tesseract installation, the files are getting extracted, but I am not getting any OCR.

In 9.1.0 i am getting the following when extracting a jpg file:

  "x_parsed_by":["org.apache.tika.parser.DefaultParser",
                 "org.apache.tika.parser.jpeg.JpegParser"],

In a setup with 8.11.2 i am getting the following when extracting the same jpg:

    "x_parsed_by":["org.apache.tika.parser.DefaultParser",
                   "org.apache.tika.parser.ocr.TesseractOCRParser",
                   "org.apache.tika.parser.jpeg.JpegParser"],

Solution

  • Turn of the security manager that is on by default in 9.x, this can be done by setting the environment variable:

    SOLR_SECURITY_MANAGER_ENABLED=false
    

    The issue is that org.apache.tika.parser.ocr.TesseractOCRParser require execution rights on the folder where tesseract is installed.

    When determening if TesseractOCRParser should be loaded it checks if it can locate and call Tesseract based on the configuaration, the check method used to see if it can execute an external parser catches SecurityException among other exceptions and just returns false without any logging, so there is no sign that something is configured wrong even if you turn up logging.