Search code examples
javaperformanceocrtesseractapache-tika

Slowness in extracting scan PDF using Apache Tika + Tesseract


From Apache Tika extract scanned PDF files, it works perfectly fine for scan document. But problem is, it is taking too much time as well as CPU utilization.

In my case, 15 MB file having 23 pages takes around ~4.5 minute which is too high. Please find below my working code,

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);

ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
//need to add this to make sure recursive parsing happens!
parseContext.set(Parser.class, parser);

Metadata metadata = new Metadata();
parser.parse(inputStream, handler, metadata, parseContext);
String content = handler.toString();

How can I make it more optimized/faster? Any suggestions?


Solution

  • As @Gagravarr mentioned in comment, this is not a Tika slowness, since Tesseract is CPU consuming process.

    To handle it, I have separated this process on another server, using FIFO method. So that only one file is processed at a time.