Search code examples
pythontesseractpython-tesseract

Tesseract: problems with upper-case character


I'm using Tesseract with Python. I have an image with 1-6 words in it and need to read the text. Sometimes the character "C", which look the same in upper and lower case, is detected as lower case c instead of upper case C. I see the problem, but in context to the following letters it should be possible to detect the right notation. Is there any configuration or something to improve this? I'm using this code

import pytesseract
import argparse
import cv2
import os

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
    help="path to input image to be OCR'd")
args = vars(ap.parse_args())

# load the example image and convert it to grayscale
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# write the grayscale image to disk as a temporary file so we can
# apply OCR to it
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, gray)

# load the image as a PIL/Pillow image, apply OCR, and then delete
# the temporary file
text = pytesseract.image_to_string(gray)
print("Output: " + text)

I had a look at the configuration options of config='-psm x' with different values for x, but nothing fits to my problem


Solution

  • Updating from Tesseract 3 to Tesseract 5 fixed the problem