Search code examples
pythonopencvimage-processingocrpython-tesseract

How to improve the OCR accuracy in this image?


I am going to extract text from a picture using OpenCV in Python and OCR by pytesseract. I have an image like this:

Input

Then I have written some code to extract the text from that picture, nut it does not have enough accuracy to extract the text properly.

That is my code:

import cv2
import pytesseract
    
img = cv2.imread('photo.jpg')
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_,img = cv2.threshold(img,110,255,cv2.THRESH_BINARY)

custom_config = r'--oem 3 --psm 6'
text = pytesseract.image_to_string(img, config=custom_config)
print(text)

cv2.imshow('pic', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

I have tested cv2.adaptiveThreshold, but it does not work like cv2.threshold.

And, finally, this is my result which is does not like my result in the picture:

Color Yellow RBC/hpf 4-6
Appereance Semi Turbid WBC/hpf 2-3
Specific Gravity 1014 Epithelial cells/Lpf 1-2
PH 7 Bacteria (Few)
Protein Pos(+) Casts Negative
Glucose Negative Mucous (Few)
Keton Negative
Blood Pos(+)
Bilirubin Negative
Urobilinogen Negative
Nigitesse 5 ed eg ative

Do you have any way to improve the accuracy?


Solution

  • I was actually quite surprised, how good the result already is, seeing this noticable skew. But, that's not the actual problem with the last line, but the shadow! This is your thresholded image:

    Thresholded

    So, pytesseract has no chance to properly detect anything meaningful from the last line. Let's try to remove the shadow, following Dan Mašek's answer here, and let Otsu do the thresholding:

    import cv2
    import numpy as np
    import pytesseract
    
    # Read input image, convert to grayscale
    img = cv2.imread('NiVUK.jpg')
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    # Remove shadows, cf. https://stackoverflow.com/a/44752405/11089932
    dilated_img = cv2.dilate(gray, np.ones((7, 7), np.uint8))
    bg_img = cv2.medianBlur(dilated_img, 21)
    diff_img = 255 - cv2.absdiff(gray, bg_img)
    norm_img = cv2.normalize(diff_img, None, alpha=0, beta=255,
                             norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8UC1)
    
    # Threshold using Otsu's
    work_img = cv2.threshold(norm_img, 0, 255, cv2.THRESH_OTSU)[1]
    
    # Tesseract
    custom_config = r'--oem 3 --psm 6'
    text = pytesseract.image_to_string(work_img, config=custom_config)
    print(text)
    

    The shadow-removed, thresholded image looks like this:

    Shadow-removed, thresholded

    And, the final output seems to be correct to me:

    Color Yellow RBC/hpf 4-6
    Appereance Semi Turbid WBC/hpf 2-3
    Specific Gravity 1014 Epithelial cells/Lpf 1-2
    PH 7 Bacteria (Few)
    Protein Pos(+) Casts Negative
    Glucose Negative Mucous (Few)
    Keton Negative
    Blood Pos(+)
    Bilirubin Negative
    Urobilinogen Negative
    Nitrite Negative
    
    ----------------------------------------
    System information
    ----------------------------------------
    Platform:      Windows-10-10.0.16299-SP0
    Python:        3.9.1
    PyCharm:       2021.1.1
    NumPy:         1.20.2
    OpenCV:        4.5.1
    pytesseract:   4.00.00alpha
    ----------------------------------------