Search code examples
c++opencvtesseract

Tesseract very low detection quality


Trying to read some data with tesseract but it's already strugling with date and time, so I created a minimal test case.

code:

#include <string>
#include <sstream>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <opencv2/opencv.hpp>
#include <opencv2/imgproc.hpp>
#include <boost/algorithm/string/trim.hpp>
using namespace std;
using namespace cv;

int main(int argc, const char * argv[]) {

    string outText, imPath = argv[1];
    cv::Mat image_final = cv::imread(imPath, CV_8UC1);

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    api->Init(NULL, "eng", tesseract::OEM_LSTM_ONLY);
    api->SetPageSegMode(tesseract::PSM_AUTO_ONLY);
    cv::adaptiveThreshold(image_final,image_final,255,ADAPTIVE_THRESH_MEAN_C, cv::THRESH_BINARY,11,2);

    api->SetImage(image_final.data, image_final.cols, image_final.rows, 3, image_final.step);
    api->SetVariable("tessedit_char_whitelist", "0123456789- :");
    outText = string(api->GetUTF8Text());
    api->End();

    std::istringstream iss(outText);

    for (std::string line; std::getline(iss, line); ) {
        boost::algorithm::trim(line);
        if (!line.empty()) cout << line << endl;
    }

    cv::imwrite("out.png", image_final);

    return 0;
}

test image

output:

1122-03-08 18:10
2122-030 18:10

I even tried to whitelist these characters (which will not be the case in the final version) but still getting very bad results.


Solution

  • It looks like the main issue is setting bytes_per_pixel to 3 instead of 1 in api->SetImage.

    The image after cv::adaptiveThreshold is 1 color channel (1 byte per pixel) and not 3.

    Replace api->SetImage(image_final.data, image_final.cols, image_final.rows, 3, image_final.step); with:

    api->SetImage(image_final.data, image_final.cols, image_final.rows, 1, image_final.step);
    

    Replace cv::imread(imPath, CV_8UC1) with cv::imread(imPath, cv::IMREAD_GRAYSCALE)


    You may also try replacing tesseract::PSM_AUTO_ONLY with tesseract::PSM_AUTO or tesseract::PSM_SINGLE_BLOCK.

    According to the comment in the header file:

    PSM_AUTO_ONLY = 2, ///< Automatic page segmentation, but no OSD, or OCR.

    (Unless this is in purpose - I never used the C++ interface).


    I have tried to reproduce the problem using pytesseract and Python, but I am getting an error when setting PSM to 2.
    I am probably also using different version of Tesseract.

    The result is perfect, and it supposed to be perfect with the image from your post.

    Python code:

    import cv2
    from pytesseract import pytesseract
    
    # Tesseract path
    pytesseract.tesseract_cmd = "C:\\Program Files\\Tesseract-OCR\\tesseract.exe"
    
    img = cv2.imread("out.png", cv2.IMREAD_GRAYSCALE)  # Read input image as Grayscale
      
    text = pytesseract.image_to_string(img, config="-c tessedit"
                                                   "_char_whitelist=' '0123456789-:"
                                                   " --psm 3 "
                                                   "lang='eng'")
    
    print(text)
    

    Output:
    2022-03-08 18:19:15