Search code examples
c#c++pinvoke

C# Managed Code optimization


I have a managed C++ DLL that I use in my C# Application. The DLL is processing a lot of images (thousands) and using OCR to extract the text from it; Even though I know that OCR Processing consumes a lot of CPU, I was wondering if it is possible to optimize the code for better performance.

Currently it takes one minute to parse approx. 15 pages PNG pages. I would to get down to around 30-40 seconds.

The C++ Code:

        char* OCRWrapper::GetUTF8Text(char* path, char* lang, char* imgPath)
        {
            char* imageText;
            tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

            if (api->Init(path, lang)) {
                fprintf(stderr, "Could not initialize tesseract. Incorrect datapath or incorrect lanauge\n"); /*This should throw an error to the caller*/
                exit(1);
            }

            /*Open a reference to the imagepath*/
            Pix *image = pixRead(imgPath);

            /*Read the image object;*/
            api->SetImage(image);

            // Get OCR result
            imageText = api->GetUTF8Text();

            /*writeToFile(outText);*/
            /*printf("OCR output:\n%s", imageText);*/

            /*Destroy the text*/
            api->End();

            pixDestroy(&image);
            /*std::string x = std::string(imageText);*/

            return imageText;
        }

The C# method that creates an instance of OCROBject class. The OCRObject is the class actually calling the DLL, see below this method.

  private void GetTextFromSavedImages(List<string> imagesPath)
    {
        try
        {
            StringBuilder allPagesText = new StringBuilder();
            OCRObject ocr = new OCRObject(this.dbHandler.GetApplicationSetting(this.m_ProfileName, "TesseractLanguage").ApplicationSettingValue, this.dbHandler.GetApplicationSetting(this.m_ProfileName, "TesseractConfigurationDataPath").ApplicationSettingValue); //Settings.Default.TesseractConfigurationDataPath
            for (int i = 0; i < imagesPath.Count; i++)
            {

                string pageText = ocr.GetOCRText(imagesPath[i]);
                this.m_pdfDictionary.Add(i + 1, pageText);
                allPagesText.Append(pageText);
            }
            this.AllPageText = allPagesText.ToString();
        }
        catch (Exception ex)
        {
            Logger.Log(ex.ToString(), LogInformationType.Error);
        }
    }

And finally the OcrObject Class:

public class OCRObject
        {
            private string m_tessLanguage;
            private string m_tessConfPath;
            [DllImport(@"\OCR\OCR.dll", EntryPoint = "GetUTF8Text", CallingConvention = CallingConvention.Cdecl)]
            private static extern IntPtr GetUTF8Text(string path, string lang, string imgPath);

            public OCRObject(string language, string tessConfPath)
            {
                if (string.IsNullOrEmpty(language))
                {
                    throw new ArgumentException("Tesseract language is null or empty.");
                }
                if (!System.IO.Directory.Exists(tessConfPath))
                {
                    throw new DirectoryNotFoundException("Could not find directory => " + tessConfPath);
                }    
                this.m_tessLanguage = language;
                this.m_tessConfPath = tessConfPath;
            }    
            public string GetOCRText(string imagePath)
            {
                return this.StringFromNativeUtf8(GetUTF8Text(this.m_tessConfPath, this.m_tessLanguage, imagePath));
            }

            private string StringFromNativeUtf8(IntPtr nativeUtf8)
            {
                try
                {
                    int len = 0;
                    if (nativeUtf8 == IntPtr.Zero)
                    {
                        return string.Empty;
                    }
                    while (Marshal.ReadByte(nativeUtf8, len) != 0) ++len;
                    byte[] buffer = new byte[len];
                    Marshal.Copy(nativeUtf8, buffer, 0, buffer.Length);
                    //GC.Collect(GC.MaxGeneration, GCCollectionMode.Optimized); /*If this help???*/
                    string text = Encoding.UTF8.GetString(buffer);
                    return text;
                }
                catch
                {
                    return string.Empty;
                }
            }
        }

Please let me know if you need more details.


Solution

  • The Tesseract FAQ suggested that people run its executable in parallel (i.e. implying that it's single-threaded).

    You could maybe try using Parallel.For to replace your for loop, and see if you can get a quick and dirty win out of it.

    Edit: They've moved to GitHub, and the new FAQ suggested that

    You will get better results having Tesseract produce one page PDF files in parallel, then splicing them together at the end