I am trying to extract numbers from images. I test tesseract-OCR, but the result is not good enough. For example,
tesseract test.jpg stdout --psm 6
will output:
4367 42424W0 104
I assume the issue is due to there is some background images in the back of the words. Is there any way that can improve the result?
You may use the convert
command of ImageMagick
to threshold the image to back-in-white. You can download ImageMagick
here, it supports multiple platform.
By typing,
convert image.jpg -threshold 33% thresholded.jpg
It outputs the image below. The threshold value is obtained after few attempts and adjustments.
Then, with the basic tesseract
command it gives a correct output.
If the image only consists of 0-9, you may enable the tesseract option to improve the recognition accuracy - -c tessedit_char_whitelist=01234567890"
.
Hope this help.