Search code examples
pythonopencvtesseract

How to solve new line problem in tessaract ocr?


I have an image with text in it. I used ocr to scan that image and I got the text correctly. There is just one problem: If there is a new line ocr won't leave space between two words.

img = cv2.imread('cropped.png')
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
result = pytesseract.image_to_string(img, lang='eng', config='--psm 6')
ret_str = ""
for letter in result:
    if letter.isalnum() or letter == " ":
        ret_str += letter.lower()
c_list = ret_str.strip()
print(c_list)

Output:

['gundam builddivers']

As you can see there is no space between build and divers in first element.

Image:


Solution

  • img = cv2.imread('cropped.png')
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
    result = pytesseract.image_to_string(img, lang='eng', config='--psm 6')
    result = result.replace("\n", " ")
    ret_str = ""
    for letter in result:
        if letter.isalnum() or letter == " ":
            ret_str += letter.lower()
    c_list = ret_str.strip()
    print(c_list)
    

    Adding .replace() is the solution