Search code examples
pythonocrtesseractpython-tesseract

How do I get PyTesseract OCR to recognise characters when there is a line on top of the chars


So I'm trying to write a program that will read in numbers of the graph (and then do stuff but that's irrelevant). I've got the following code, which works mostly.

img_resource_path = resource_path(img_path)
pytesseract.tesseract_cmd = PATH_TO_TESSERACT
img = cv2.imread(img_resource_path, 0)
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
thresh = 255 - cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
x,y,w,h = 0, 615, 680, 200  
ROI = thresh[y:y+h,x:x+w]
text = pytesseract.image_to_string(ROI, lang='eng')

The issue is that there is a black line on top of some of the lines of the table, which makes Tesseract read the characters incorrectly. It should output -90.58dB but it outputs -950.58dB.

How do I make it so it ignores the black line on top? (Picture attached with what it looks like)

Edit: Can I create my own training data and use that. A few online sources including the Tesseract docs said that likely retraining will not help. Any opinions?

Data


Solution

  • Just try with easyocr. It using tesseract engine for OCR operation. install easyocr by pip install easyocr

    import easyocr
    reader = easyocr.Reader(['en'], gpu=False)
    result = reader.readtext('3.png')
    for detection in result:
        print(detection) 
    

    output is,

    ([[8, 8], [458, 8], [458, 88], [8, 88]], '-90.58 dB', 0.7896991366270714)
    ([[7, 84], [460, 84], [460, 168], [7, 168]], '-90.58 dB', 0.7465829283820857)