Search code examples
c#.netocrtesseract

How to speed up tesseract OCR


I'm trying to OCR a lot of documents(I mean in 300k + range a day). At the moment i'm using Tesseract wrapper for .NET and it's all good in quality but the speed is not good enough. The times i get for 20 tasks in parallel scanning of a half page from the same pdf in average are 2,546 second per scan. The code im using:

using (var engine = new TesseractEngine(Tessdata, "eng", EngineMode.TesseractOnly))
        {
            Page page;
            page = engine.Process(image, srcRect);        
            var text = page.GetText();
            return Task.FromResult(text);
        }

The average time i get is after lowering the resolution of image by half and converting it to grayscale. Any ideas to speed up the process? I don't need to have text segmentated, just the text in one line. Should i maybe use something as Matlab for c#?


Solution

  • Currently, you create a new TesseractEngine object for each page you scan. Creating the engine is costly because it reads the 'tessdata' files.

    You say you have 20 parallel tasks running. Since the engine cannot process multiple pages at once you will need to create one engine per task and reuse it for all the pages that task processes. You can simply call using (var page = Engine.Process(pix)) to process the next page with an existing engine.

    Reusing the engine should significantly improve performance because you'll only have to create 20 engines instead of 300k.