Search code examples
pythontesseractpython-tesseract

tesseract doesnt recognize individual text segments after whitelisting


I have an image I want to extract text from using tesseract and python. I only want to recognize a certain set of characters so I use tessedit_char_whitelist=1234567890CBDE as a config. However now tesseract doesnt seem to recognize the gaps between the lines anymore. Is there some character I can add to the whitelist so it recognizes the text as individual text again?

Here is the image after the whitelist:

after whitelist

Here is the image before the whitelist:

before WL

Here is the code responsible for drawing the boxes and the recognizing the characters in case youre wondering:


#configuring parameters for tesseract
# whitlist = "-c tessedit_char_whitelist=1234567890CBDE"
custom_config = r'--oem 3 --psm 6 ' 
# now feeding image to tesseract
details = pytesseract.image_to_data(threshold_img, output_type=Output.DICT, config=custom_config, lang='eng')
print(details.keys())

total_boxes = len(details['text'])
for sequence_number in range(total_boxes):
    # confidence above 30 %
    CONFIDENCE = 0
    if int(details['conf'][sequence_number]) >= CONFIDENCE:
        (x, y, w, h) = (details['left'][sequence_number], details['top'][sequence_number], details['width'][sequence_number],  details['height'][sequence_number])
        threshold_img = cv2.rectangle(threshold_img, (x, y), (x + w, y + h), (0, 255, 0), 2)
# display image
cv2.imshow('captured text', threshold_img)
cv2.imwrite("before.png", threshold_img)
# Maintain output window until user presses a key
cv2.waitKey(0)
# Destroying present windows on screen
cv2.destroyAllWindows()

EDIT:

Here is the original image I want to extract the text from with the goal to write it to a matrix:

Original image

The desired matrix would take the following form:


content = [
    ["1C", "55", "55", "E9", "BD"],
    # ...
    ["1C", "1C", "55", "BD", "BD"]
]

Solution

  • One Solution is:


      1. Individually take each tuple and upsample by 2
      1. Apply threshold
      1. Recognize by setting page-segmentation-mode to 6

    Tuple enter image description here enter image description here enter image description here enter image description here enter image description here
    Threshold enter image description here enter image description here enter image description here enter image description here enter image description here
    Result 1C 55 55 E9 BO
    Tuple enter image description here enter image description here enter image description here enter image description here enter image description here
    Threshold enter image description here enter image description here enter image description here enter image description here enter image description here
    Result 1C 1C 55 BO 1C
    Tuple enter image description here enter image description here enter image description here enter image description here enter image description here
    Threshold enter image description here enter image description here enter image description here enter image description here enter image description here
    Result 1C 55 BO 55 IC
    Tuple enter image description here enter image description here enter image description here enter image description here enter image description here
    Threshold enter image description here enter image description here enter image description here enter image description here enter image description here
    Result 1C BD 50 1C 1C
    Tuple enter image description here enter image description here enter image description here enter image description here enter image description here
    Threshold enter image description here enter image description here enter image description here enter image description here enter image description here
    Result 1C 1C 55 BD BD

    The idea is taking each tuple separately, upsampling it, and then applying inverse-binary-threshold. Tesseract misinterpreted few tuples due to the font. For instance, if you look at the character D which looks like O. If you want 100% accuracy, then I suggest you train the tesseract. Also, make sure you try with other page-segmentation-modes

    Here is the array output:

    [['1C', '55', '55', 'E9', 'BO'], ['1C', '1C', '55', 'BO', '1C'], ['1C', '55', 'BO', '55', 'IC'], ['1C', 'BD', '50', '1C', '1C'], ['1C', '1C', '55', 'BD', 'BD']]
    

    Code:


    import cv2
    import pytesseract
    
    img = cv2.imread("IVemF.png")
    gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    (h, w) = gry.shape[:2]
    s_idx1 = 0  # start index1
    e_idx1 = int(h/5)  # end index1
    cfg = "--psm 6"
    res = []
    
    for _ in range(0, 5):
        s_idx2 = 0  # start index2
        e_idx2 = int(w / 5)  # end index2
        row = []
        for _ in range(0, 5):
            crp = gry[s_idx1:e_idx1, s_idx2:e_idx2]
            (h_crp, w_crp) = crp.shape[:2]
            crp = cv2.resize(crp, (w_crp*2, h_crp*2))
            thr = cv2.threshold(crp, 0, 255,
                                cv2.THRESH_BINARY_INV |
                                cv2.THRESH_OTSU)[1]
            txt = pytesseract.image_to_string(thr,
                                              config=cfg)
            txt = txt.replace("\n\x0c", "")
            row.append(txt.upper())
            print(txt.upper())
            s_idx2 = e_idx2
            e_idx2 = s_idx2 + int(w/5)
            cv2.imshow("thr", thr)
            cv2.waitKey(0)
        res.append(row)
        s_idx1 = e_idx1
        e_idx1 = s_idx1 + int(h/5)
    
    print(res)