how to improve the result of tesseract when the words has background image

I am trying to extract numbers from images. I test tesseract-OCR, but the result is not good enough. For example,

tesseract test.jpg stdout --psm 6

will output:

4367 42424W0 104

I assume the issue is due to there is some background images in the back of the words. Is there any way that can improve the result?

Solution

You may use the convert command of ImageMagick to threshold the image to back-in-white. You can download ImageMagick here, it supports multiple platform.

By typing,

convert image.jpg -threshold 33% thresholded.jpg

It outputs the image below. The threshold value is obtained after few attempts and adjustments.

Then, with the basic tesseract command it gives a correct output.

If the image only consists of 0-9, you may enable the tesseract option to improve the recognition accuracy - -c tessedit_char_whitelist=01234567890".

Hope this help.