Search code examples
ocrtesseract

how to improve the result of tesseract when the words has background image


I am trying to extract numbers from images. I test tesseract-OCR, but the result is not good enough. For example,

tesseract test.jpg stdout --psm 6

enter image description here

will output:

4367 42424W0 104

I assume the issue is due to there is some background images in the back of the words. Is there any way that can improve the result?


Solution

  • You may use the convert command of ImageMagick to threshold the image to back-in-white. You can download ImageMagick here, it supports multiple platform.

    By typing,

    convert image.jpg -threshold 33% thresholded.jpg
    

    It outputs the image below. The threshold value is obtained after few attempts and adjustments.

    enter image description here

    Then, with the basic tesseract command it gives a correct output.

    enter image description here

    If the image only consists of 0-9, you may enable the tesseract option to improve the recognition accuracy - -c tessedit_char_whitelist=01234567890".

    Hope this help.