Search code examples
pythonopencvocrtesseractpython-tesseract

Pytesseract doesn't recognize decimal points


I'm trying to read the text in this image that contains also decimal points and decimal numbers enter image description here

in this way:

img = cv2.imread(path_to_image)
print(pytesseract.image_to_string(img))

and what I get is:

73-82
Primo: 50 —

I've tried to specify also the italian language but the result is pretty similar:

73-82 _
Primo: 50

Searching through other questions on stackoverflow I found that the reading of the decimal numbers can be improved by using a whitelist, in this case tessedit_char_whitelist='0123456789.', but I want to read also the words in the image. Any idea on how to improve the reading of decimal numbers?


Solution

  • I would suggest passing tesseract every row of text as separate image.
    For some reason it seams to solve the decimal point issue...

    • Convert image from grayscale to black and white using cv2.threshold.
    • Use cv2.dilate morphological operation with very long horizontal kernel (merge blocks across horizontal direction).
    • Use find contours - each merged row is going to be in a separate contour.
    • Find bounding boxes of the contours.
    • Sort the bounding boxes according to the y coordinate.
    • Iterate bounding boxes, and pass slices to pytesseract.

    Here is the code:

    import numpy as np
    import cv2
    import pytesseract
    
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # I am using Windows
    
    path_to_image = 'image.png'
    
    img = cv2.imread(path_to_image, cv2.IMREAD_GRAYSCALE)  # Read input image as Grayscale
    
    # Convert to binary using automatic threshold (use cv2.THRESH_OTSU)
    ret, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
    
    # Dilate thresh for uniting text areas into blocks of rows.
    dilated_thresh = cv2.dilate(thresh, np.ones((3,100)))
    
    
    # Find contours on dilated_thresh
    cnts = cv2.findContours(dilated_thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)[-2]  # Use index [-2] to be compatible to OpenCV 3 and 4
    
    # Build a list of bounding boxes
    bounding_boxes = [cv2.boundingRect(c) for c in cnts]
    
    # Sort bounding boxes from "top to bottom"
    bounding_boxes = sorted(bounding_boxes, key=lambda b: b[1])
    
    
    # Iterate bounding boxes
    for b in bounding_boxes:
        x, y, w, h = b
    
        if (h > 10) and (w > 10):
            # Crop a slice, and inverse black and white (tesseract prefers black text).
            slice = 255 - thresh[max(y-10, 0):min(y+h+10, thresh.shape[0]), max(x-10, 0):min(x+w+10, thresh.shape[1])]
    
            text = pytesseract.image_to_string(slice, config="-c tessedit"
                                                              "_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-:."
                                                              " --psm 3"
                                                              " ")
    
            print(text)
    

    I know it's not the most general solution, but it manages to solve the sample you have posted.
    Please treat the answer as a conceptual solution - finding a robust solution might be very challenging.


    Results:

    Thresholder image after dilate:
    enter image description here

    First slice:
    enter image description here

    Second slice:
    enter image description here

    Third slice:
    enter image description here

    Output text:

    7.3-8.2

    Primo:50