Search code examples
pythonopencvocrtesseract

Tesseract OCR fails to detect varying font size and letters that are not horizontally aligned


I am trying to detect these price labels text which is always clearly preprocessed. Although it can easily read the text written above it, it fails to detect price values. I am using python bindings pytesseract although it also fails to read from the CLI commands. Most of the time it tries to recognize the part where the price as one or two characters.

Sample 1:

tesseract D:\tesseract\tesseract_test_images\test.png output

And the output of the sample image is this.

je Beutel

13

However if I crop and stretch the price to look like they are seperated and are the same font size, output is just fine.

Processed image(cropped and shrinked price):

je Beutel

1,89

How do get OCR tesseract to work as I intended, as I will be going over a lot of similar images? Edit: Added more price tags:
sample2sample3sample4sample5 sample6 sample7


Solution

  • The problem is that the Tesseract engine was not trained to read this kind of text topology.

    You can:

    • train your own model, and you'll need in particular to provide images with variations of topology (position of characters). You can actually use the same image, and shuffle the positions of the characters.
    • reorganize the image into clusters of text and use tesseract, in particular, I would consider the cents part and move it on the right of the coma, in that case you can use tesseract out of the box. Few relevant criterions would be the height of the clusters (to differenciate cents and integers), and the position of the clusters (read from the left to the right).

    In general computer vision algorithms (including CNNs) are giving you tool to have a higher representation of an image (features or descriptors), but they fail to create a logic or an algorithm to process intermediate results in a certain way.

    In your case that would be:

    • "if the height of those letters are smaller, it's cents",
    • "if the height, and vertical position is the same, it's about the same number, either on left of coma, or on the right of coma".

    The thing is that it's difficult to reach that through training, and at the same time it's extremely simple to write this for a human as an algorithm. Sorry for not giving you an actual implementation, but my text is the pseudo code.

    TrainingTesseract2

    TrainingTesseract4

    Joint Unsupervised Learning of Deep Representations and Image Clusters