Search code examples
c++ocrtesseract

Why tesseract::ResultIterator breaks Chinese word into separate words?


I have such picture: Chinese characters

I want to find location of "简体中文", but for some reason with ResultIteratorLevel::RIL_WORD the ResultIterator breaks it like this:

word: "简体"
word: "中"
word: "文"

I don't understand why this happens. I've tried a lot of options, different page segmentation modes, but no luck. However, when I use getUTF8Text() with specified coordinates it returns the correct "简体中文" Chinese text. How I can get the correct result using the ResultIterator?

Versions:

tesseract 5.0.0
 leptonica-1.78.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511

Full code:

#include <iostream>

#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>

int main() {
  const char *pattern = "简体中文";
  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

  Pix *image = pixRead("chinese_characters.png");
  if (api->Init("/usr/local/share/tessdata/", "chi_sim")) {
    fprintf(stderr, "Could not initialize tesseract.\n");
    exit(1);
  }
  api->SetImage(image);
  api->Recognize(0);
  tesseract::ResultIterator *ri = api->GetIterator();
  tesseract::PageIteratorLevel level = tesseract::RIL_WORD;

  if (ri != 0) {
    do {
      const char *word = ri->GetUTF8Text(level);
      float conf = ri->Confidence(level);
      int x1, y1, x2, y2;
      ri->BoundingBox(level, &x1, &y1, &x2, &y2);
      printf("word: '%s';  conf: %.2f; BoundingBox: %d,%d,%d,%d;\n", word, conf,
             x1, y1, x2, y2);
    } while (ri->Next(level));
  }
  // Destroy used object and release memory
  api->End();
  delete api;
  pixDestroy(&image);

  return 0;
}

Full output:

word: '单词';  conf: 94.69; BoundingBox: 170,226,270,275;
word: '“单词';  conf: 55.34; BoundingBox: 390,226,490,275;
word: '单词';  conf: 88.91; BoundingBox: 610,226,710,275;
word: '单词';  conf: 92.26; BoundingBox: 830,226,930,275;
word: '简体';  conf: 96.09; BoundingBox: 95,372,199,421;
word: '中';  conf: 93.13; BoundingBox: 228,372,291,421;
word: '文';  conf: 48.71; BoundingBox: 290,368,348,444;
word: '”单词';  conf: 48.71; BoundingBox: 393,375,493,424;
word: '单词';  conf: 91.40; BoundingBox: 613,375,713,424;
word: '单词';  conf: 86.79; BoundingBox: 833,375,933,424;
word: '单词';  conf: 57.25; BoundingBox: 1053,375,1153,424;
word: '单词';  conf: 94.69; BoundingBox: 174,520,274,569;
word: '“单词';  conf: 55.34; BoundingBox: 394,520,494,569;
word: '单词';  conf: 88.91; BoundingBox: 614,520,714,569;
word: '单词';  conf: 92.26; BoundingBox: 834,520,934,569;

Solution

  • Actually this is a correct behavior, because in Chinese some specific symbols may be as separate words. If you want to recognize such symbols together without spaces then just use the tesseract::RIL_SYMBOL instead of tesseract::RIL_WORD. Thus, you can iterate through each symbol one by one.