I am using Tesseract for read japanes text. I am getting below text from OCR.
日付 請求書
C++ code
extern "C" _declspec(dllexport) char* _cdecl Test(char* imagePath)
{
char *outText;
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
// Initialize tesseract-ocr with English, without specifying tessdata path
if (api->Init("D:\\tessdata", "jpn", tesseract::OcrEngineMode::OEM_TESSERACT_ONLY))
{
fprintf(stderr, "Could not initialize tesseract.\n");
}
api->SetPageSegMode(tesseract::PageSegMode::PSM_AUTO);
outText = api->GetUTF8Text();
return outText;
}
c#
[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
public static extern string Test(string imagePath);
void Tessrect()
{
string result = Test("D:\\japan4.png");
byte[] bytes = System.Text.Encoding.Default.GetBytes(result);
MessageBox.Show(System.Text.Encoding.UTF8.GetString(bytes));
}
The above code is working fine in window English. But it not working in window japanes. It gives the wrong output in window's Japanes OS.
Can any one guide me how to get it correct for Japanes Window?
The outText
seems to be already in UTF-8 format
outText = api->GetUTF8Text();
Now... Returning a byte[]
(or similar) from C++ is a pain... Change to:
[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
public static extern IntPtr Test(string imagePath);
Then take the StringFromNativeUtf8
from here (because even converting a IntPtr
that is a UTF-8 c-string is a pain... .NET doesn't natively have anything like that):
void Tessrect()
{
IntPtr result = IntPtr.Zero;
string result2;
try
{
result = Test("D:\\japan4.png");
result2 = StringFromNativeUtf8(result);
}
finally
{
Free(result);
}
MessageBox.Show(result2);
}
Then you'll have to free the IntPtr
... another pain.
[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
public static extern void Free(IntPtr ptr);
and
extern "C" _declspec(dllexport) void _cdecl Free(char* ptr)
{
delete[] ptr;
}