I'm trying to read the text in this image that contains also decimal points and decimal numbers
in this way:
img = cv2.imread(path_to_image)
print(pytesseract.image_to_string(img))
and what I get is:
73-82
Primo: 50 —
I've tried to specify also the italian language but the result is pretty similar:
73-82 _
Primo: 50
Searching through other questions on stackoverflow I found that the reading of the decimal numbers can be improved by using a whitelist, in this case tessedit_char_whitelist='0123456789.'
, but I want to read also the words in the image. Any idea on how to improve the reading of decimal numbers?
I would suggest passing tesseract every row of text as separate image.
For some reason it seams to solve the decimal point issue...
cv2.threshold
.cv2.dilate
morphological operation with very long horizontal kernel (merge blocks across horizontal direction).pytesseract
.Here is the code:
import numpy as np
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # I am using Windows
path_to_image = 'image.png'
img = cv2.imread(path_to_image, cv2.IMREAD_GRAYSCALE) # Read input image as Grayscale
# Convert to binary using automatic threshold (use cv2.THRESH_OTSU)
ret, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
# Dilate thresh for uniting text areas into blocks of rows.
dilated_thresh = cv2.dilate(thresh, np.ones((3,100)))
# Find contours on dilated_thresh
cnts = cv2.findContours(dilated_thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)[-2] # Use index [-2] to be compatible to OpenCV 3 and 4
# Build a list of bounding boxes
bounding_boxes = [cv2.boundingRect(c) for c in cnts]
# Sort bounding boxes from "top to bottom"
bounding_boxes = sorted(bounding_boxes, key=lambda b: b[1])
# Iterate bounding boxes
for b in bounding_boxes:
x, y, w, h = b
if (h > 10) and (w > 10):
# Crop a slice, and inverse black and white (tesseract prefers black text).
slice = 255 - thresh[max(y-10, 0):min(y+h+10, thresh.shape[0]), max(x-10, 0):min(x+w+10, thresh.shape[1])]
text = pytesseract.image_to_string(slice, config="-c tessedit"
"_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-:."
" --psm 3"
" ")
print(text)
I know it's not the most general solution, but it manages to solve the sample you have posted.
Please treat the answer as a conceptual solution - finding a robust solution might be very challenging.
Results:
Thresholder image after dilate:
Output text:
7.3-8.2
Primo:50