Search code examples
c++ocrtesseract

Why do I get such poor results from Tesseract for simple single character recognizing?


I am trying to use Tesseract for single character recognizing and the results are awful. "h" is recognized as "n", "4" as "/i", "O" as "()";

h_char

4_char

O_char

Single character mode seems not to act, as many characters are recognized as two characters, not just one. My images are simple bilevel black and white TIFF images, latin characters. This is bitmap font, not scanned images, they are absolutely clean and need no improvement. Оnly about half of the characters are correctly recognized, which seems to be a very low percent for such a simple task.

The Tesseract library version I am using is "4.0.0-beta.3". This is how I call Tesseract.

 int CharRecognizer::recognizeTIFFData(char* data, int datalength){
            char *outText;
            TessBaseAPI* api = new TessBaseAPI();

            if (api->Init(NULL, "eng")) {
                    fprintf(stderr, "Could not initialize tesseract.\n");
                    exit(1);
            }
            api->SetPageSegMode(tesseract::PSM_SINGLE_CHAR);
            Pix *image = pixReadMem(data,datalength);
            api->SetImage(image);
            // Get OCR result
            outText = api->GetUTF8Text();
            printf("\nOCR output:\n%s", outText);
            // Destroy used object and release memory
            int utf8 = outText[0];
            api->End();
            delete[] outText;
            pixDestroy(&image);
            return utf8;
 }

I am new to Tesseract, so probably I am missing something. Do I have to somehow train the library first? May be I should set another OcrEngineMode? I have expected no problems with simple bitmap font recognizing and am quite at lost now. Thank you very much in advance, Yuliana


Solution

  • I was able to make tesseract produce correct results in your case by adding a 1x1 pixel border around your images. I tested this using the tesseract command line tool on Linux:

    $ tesseract R2a51.png stdout --psm 10
    n
    $ convert R2a51.png -border 1x1 R2a51.border.png
    $ tesseract R2a51.border.png stdout --psm 10
    h
    

    The convert tool is used to create a version of the image with the border.

    It seems that tesseract cannot handle characters bordering on the image edge correctly (at least with default settings).

    N.B. Your third character is still recognized as 0 not O but I am not sure this can be considered an OCR error. You might want to look into tesseract character white lists to deal with that.

    Edit: It also seems that "Tesseract legacy algorithm" works on your images without modification. It can be invoked on the command line via --oem 0. Beware that you need matching *.traineddata for your language in your tessdata directory. An adequate variant can be downloaded from https://github.com/tesseract-ocr/tessdata