I have such picture: Chinese characters
I want to find location of "简体中文"
, but for some reason with ResultIteratorLevel::RIL_WORD
the ResultIterator
breaks it like this:
word: "简体"
word: "中"
word: "文"
I don't understand why this happens. I've tried a lot of options, different page segmentation modes, but no luck. However, when I use getUTF8Text()
with specified coordinates it returns the correct "简体中文"
Chinese text.
How I can get the correct result using the ResultIterator
?
Versions:
tesseract 5.0.0
leptonica-1.78.0
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX512BW
Found AVX512F
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Full code:
#include <iostream>
#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>
int main() {
const char *pattern = "简体中文";
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
Pix *image = pixRead("chinese_characters.png");
if (api->Init("/usr/local/share/tessdata/", "chi_sim")) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}
api->SetImage(image);
api->Recognize(0);
tesseract::ResultIterator *ri = api->GetIterator();
tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
if (ri != 0) {
do {
const char *word = ri->GetUTF8Text(level);
float conf = ri->Confidence(level);
int x1, y1, x2, y2;
ri->BoundingBox(level, &x1, &y1, &x2, &y2);
printf("word: '%s'; conf: %.2f; BoundingBox: %d,%d,%d,%d;\n", word, conf,
x1, y1, x2, y2);
} while (ri->Next(level));
}
// Destroy used object and release memory
api->End();
delete api;
pixDestroy(&image);
return 0;
}
Full output:
word: '单词'; conf: 94.69; BoundingBox: 170,226,270,275;
word: '“单词'; conf: 55.34; BoundingBox: 390,226,490,275;
word: '单词'; conf: 88.91; BoundingBox: 610,226,710,275;
word: '单词'; conf: 92.26; BoundingBox: 830,226,930,275;
word: '简体'; conf: 96.09; BoundingBox: 95,372,199,421;
word: '中'; conf: 93.13; BoundingBox: 228,372,291,421;
word: '文'; conf: 48.71; BoundingBox: 290,368,348,444;
word: '”单词'; conf: 48.71; BoundingBox: 393,375,493,424;
word: '单词'; conf: 91.40; BoundingBox: 613,375,713,424;
word: '单词'; conf: 86.79; BoundingBox: 833,375,933,424;
word: '单词'; conf: 57.25; BoundingBox: 1053,375,1153,424;
word: '单词'; conf: 94.69; BoundingBox: 174,520,274,569;
word: '“单词'; conf: 55.34; BoundingBox: 394,520,494,569;
word: '单词'; conf: 88.91; BoundingBox: 614,520,714,569;
word: '单词'; conf: 92.26; BoundingBox: 834,520,934,569;
Actually this is a correct behavior, because in Chinese some specific symbols may be as separate words. If you want to recognize such symbols together without spaces then just use the tesseract::RIL_SYMBOL
instead of tesseract::RIL_WORD
. Thus, you can iterate through each symbol one by one.