Search code examples
javaocrtesseractapache-tika

Is there a way to disable OCR mode in Tika without uninstalling tesseract


I am using tika-app jar for my project and is there a way to disable tesseract OCR in tika. There are two things which has to be kept as such:

1.tesseract cannot be uninstalled

2.tika.xml can't be edited, as tika-app.jar is used off the shelf

Is there a way to set the configuration in the java code by setting the context or parser property to disable OCR?

I tried the below code but still OCR extracts the text from image files while parsing.

            PDFParserConfig pdfConfig = new PDFParserConfig();
            pdfConfig.setOcrStrategy(OCR_STRATEGY.NO_OCR);
            context.set(PDFParserConfig.class, pdfConfig);```

Solution

  • <?xml version="1.0" encoding="UTF-8"?>
    <properties>
        <parsers>
            <parser class="org.apache.tika.parser.DefaultParser">
           <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
            </parser>
        </parsers>
    </properties>