Search code examples
pythonopencvtesseract

Tesseract images that are nearly identical parse differently


I am opening an image doing a morphologic transformation and saving it. However, there is visibly no different between the images (even if you zoom in to the pixels). Image links are below. One of them parses correctly and the other parses incorrectly.

Here's the kicker. If I open the image that isn't parsing correctly in MS Paint, do absolutely nothing, and then click save, it will magically start parsing correctly.

Can anyone provide an explanation to this?

Here is my code

img = cv2.imread(IMAGE, 1)
imgray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
kernel = np.ones((1,40), np.uint8)
morphed = cv2.morphologyEx(imgray, cv2.MORPH_CLOSE, kernel)
dst = cv2.add(imgray, (255-morphed))
cv2.imwrite("out.png", dst)

Image parsed as "52.983.842.":

enter image description here

Image incorrectly parsed as "522.983.8422.":

enter image description here


Solution

  • The two images differ indeed.

    If you shove them into GIMP, and put the layer overlay mode to Subtract, you get this:

    difference

    After the last 2, the difference seems to contain some artifact, which Tesseract thinks is another digit.

    Saving the result using Paint might recode the output.

    Consider that your pictures are JPG, which are lossy-compressed. There are several ways to make the compression tables, and you'll get different artifacts depending on it. It just seems that this current case, Tesseract picked up the noise.

    And also note that JPG and text don't go well with each other. You should consider using lossless formats, like PNG.