python-3.x python-tesseract text-extraction

Python pytesseract extract number from various images

I have various type of images like those:

As you see, they are all kinda similar, however I do not manage to properly extract the number on them.

So far my code consists in the following:

    lower = np.array([250,200,90], dtype="uint8")
    upper = np.array([255,204,99], dtype="uint8")

    mask = cv2.inRange(img, lower, upper)
    res = cv2.bitwise_and(img, img, mask=mask)

    data = image_to_string(res, lang="eng", config='--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789')
    numbers = int(''.join(re.findall(r'\d+', data)))

I tried twearking the psm parameter 6,8 and 13 they all work for some of those examples, but none on all, and I have no idea how I could circumvent my problem.

Another solution proposed is:

gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
gry = cv2.resize(gry, (w*2, h*2))
erd = cv2.erode(gry, None, iterations=1)
thr = cv2.threshold(erd, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
bnt = cv2.bitwise_not(thr)

However, on the first picture, bnt gives:

And then pytesseract sees 460..

Any idea please?

Solution

My approach:

Upsampling is required for accurate recognition. Resizing two-times will make the image readable.

Erosion operation is a morphological operation helps to remove the boundary of the pixels. Erosion remove the strokes on the digit, make it easier to detect.

Thresholding (Binary and Inverse Binary) helps to reveal the features.

Bitwise-not is an arithmetic operation highly useful for extracting part of the image.

You can learn more methods simple reading from Improving the quality of the output

Erosion	Threshold	Bitwise-not

Update

The first image is easy to read, since it is not requiring any pre-processing technique. Please read How to Improve Quality of Tesseract

Result:

Code:

import cv2
import pytesseract

img_lst = ["oqWjd.png", "YZDt1.png", "MUShJ.png", "kbK4m.png", "POIK2.png", "4W3R4.png"]

for i, img_nm in enumerate(img_lst):
    img = cv2.imread(img_nm)
    gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    (h, w) = gry.shape[:2]
    if i == 0:
        thr = gry
    else:
        gry = cv2.resize(gry, (w * 2, h * 2))
        erd = cv2.erode(gry, None, iterations=1)
        if i == len(img_lst)-1:
            thr = cv2.threshold(erd, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
        else:
            thr = cv2.threshold(erd, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    bnt = cv2.bitwise_not(thr)
    txt = pytesseract.image_to_string(bnt, config="--psm 6 digits")
    print("".join([t for t in txt if t.isalnum()]))
    cv2.imshow("bnt", bnt)
    cv2.waitKey(0)

If you want to display comma in the result, change print("".join([t for t in txt if t.isalnum()])) line to print(txt).

Not that on the fourth image the threshold method changed from binary to inverse-binary. Binary thresholding is not working accurately on all images. Therefore you need to change.