Search code examples
ocrtesseract

How to process my images to help Tesseract?


I have some images containing only digits, and a semicolon.

Example: https://i.imgur.com/o4iz80V.png

You can see more here: https://imgur.com/a/54dsl6h

They seem pretty clean and straightforward to me, but Tesseract considers them as empty "pages" (Empty page!!).

I tried both with oem 1 and oem 0 with a character list:

  • tesseract processed/35.0.png stdout -c tessedit_char_whitelist=0123456789: --oem 0

  • tesseract processed/35.0.png stdout

What can I do to get Tesseract to recognize the characters better?


Solution

  • Tesseract still gives me pretty bad results overall, but making the text bolder with a simple dilatation algorithm helped a bit.

    In the end, since the font is really square, I used a trick, where I defined a bunch of segments for each digits, and depending on which segments intersect, or dont intersect with the digit, I can determine with 99% accuracy which digit it is.