Search code examples
c#c++encodingtesseract

How to Encoding Japanese text in Japanese window OS?


I am using Tesseract for read japanes text. I am getting below text from OCR.

日付 請求書

C++ code

 extern "C" _declspec(dllexport) char* _cdecl Test(char* imagePath)
    {
        char *outText;

        tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
        // Initialize tesseract-ocr with English, without specifying tessdata path
        if (api->Init("D:\\tessdata", "jpn", tesseract::OcrEngineMode::OEM_TESSERACT_ONLY))
        {
            fprintf(stderr, "Could not initialize tesseract.\n");           
        }

        api->SetPageSegMode(tesseract::PageSegMode::PSM_AUTO);      
        outText = api->GetUTF8Text();

        return outText;
    }

c#

[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
        public static extern string Test(string imagePath);

        void Tessrect()
        {
            string result = Test("D:\\japan4.png");
            byte[] bytes = System.Text.Encoding.Default.GetBytes(result);
            MessageBox.Show(System.Text.Encoding.UTF8.GetString(bytes));
        }

Input File: enter image description here

The above code is working fine in window English. But it not working in window japanes. It gives the wrong output in window's Japanes OS.

Can any one guide me how to get it correct for Japanes Window?


Solution

  • The outText seems to be already in UTF-8 format

    outText = api->GetUTF8Text();
    

    Now... Returning a byte[] (or similar) from C++ is a pain... Change to:

    [DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
    public static extern IntPtr Test(string imagePath);
    

    Then take the StringFromNativeUtf8 from here (because even converting a IntPtr that is a UTF-8 c-string is a pain... .NET doesn't natively have anything like that):

    void Tessrect()
    {
        IntPtr result = IntPtr.Zero;
        string result2;
    
        try
        {
            result = Test("D:\\japan4.png");
            result2 = StringFromNativeUtf8(result);
        }
        finally
        {
            Free(result);
        }
    
        MessageBox.Show(result2);
    }
    

    Then you'll have to free the IntPtr... another pain.

    [DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
    public static extern void Free(IntPtr ptr);
    

    and

    extern "C" _declspec(dllexport) void _cdecl Free(char* ptr)
    {
        delete[] ptr;
    }