python opencv ocr tesseract python-tesseract

Pytesseract doesn't recognize decimal points

I'm trying to read the text in this image that contains also decimal points and decimal numbers

in this way:

img = cv2.imread(path_to_image)
print(pytesseract.image_to_string(img))

and what I get is:

73-82
Primo: 50 —

I've tried to specify also the italian language but the result is pretty similar:

73-82 _
Primo: 50

Searching through other questions on stackoverflow I found that the reading of the decimal numbers can be improved by using a whitelist, in this case tessedit_char_whitelist='0123456789.', but I want to read also the words in the image. Any idea on how to improve the reading of decimal numbers?

Solution

I would suggest passing tesseract every row of text as separate image.
For some reason it seams to solve the decimal point issue...

Convert image from grayscale to black and white using cv2.threshold.
Use cv2.dilate morphological operation with very long horizontal kernel (merge blocks across horizontal direction).
Use find contours - each merged row is going to be in a separate contour.
Find bounding boxes of the contours.
Sort the bounding boxes according to the y coordinate.
Iterate bounding boxes, and pass slices to pytesseract.

Here is the code:

import numpy as np
import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # I am using Windows

path_to_image = 'image.png'

img = cv2.imread(path_to_image, cv2.IMREAD_GRAYSCALE)  # Read input image as Grayscale

# Convert to binary using automatic threshold (use cv2.THRESH_OTSU)
ret, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# Dilate thresh for uniting text areas into blocks of rows.
dilated_thresh = cv2.dilate(thresh, np.ones((3,100)))


# Find contours on dilated_thresh
cnts = cv2.findContours(dilated_thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)[-2]  # Use index [-2] to be compatible to OpenCV 3 and 4

# Build a list of bounding boxes
bounding_boxes = [cv2.boundingRect(c) for c in cnts]

# Sort bounding boxes from "top to bottom"
bounding_boxes = sorted(bounding_boxes, key=lambda b: b[1])


# Iterate bounding boxes
for b in bounding_boxes:
    x, y, w, h = b

    if (h > 10) and (w > 10):
        # Crop a slice, and inverse black and white (tesseract prefers black text).
        slice = 255 - thresh[max(y-10, 0):min(y+h+10, thresh.shape[0]), max(x-10, 0):min(x+w+10, thresh.shape[1])]

        text = pytesseract.image_to_string(slice, config="-c tessedit"
                                                          "_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-:."
                                                          " --psm 3"
                                                          " ")

        print(text)

I know it's not the most general solution, but it manages to solve the sample you have posted.
Please treat the answer as a conceptual solution - finding a robust solution might be very challenging.

Results:

Thresholder image after dilate:

First slice:

Second slice:

Third slice:

Output text:

7.3-8.2

Primo:50