I'm using Tesseract to detect a black written word with a white background. In some images after the black word occurs an info symbol. I'm not interested in detecting this symbol, I'm only interested in the word. Sometimes the info symbol (with a circle around) is detected as 0 or O, this is fine. But in other cases (probably if tesseract doesn't know how to handle this sign) it is just returning an empty string, so the word is not returned as well. I used the code given here and also tried the configuration suggested here
from PIL import Image
import pytesseract
import argparse
import cv2
import os
import numpy as np
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
help="path to input image to be OCR'd")
ap.add_argument("-p", "--preprocess", type=str, default="thresh",
help="type of preprocessing to be done")
args = vars(ap.parse_args())
# load the example image and convert it to grayscale
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# check to see if we should apply thresholding to preprocess the
# image
if args["preprocess"] == "thresh":
gray = cv2.threshold(gray, 0, 255,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# make a check to see if median blurring should be done to remove
# noise
elif args["preprocess"] == "blur":
gray = cv2.medianBlur(gray, 3)
# write the grayscale image to disk as a temporary file so we can
# apply OCR to it
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, gray)
# load the image as a PIL/Pillow image, apply OCR, and then delete
# the temporary file
text = pytesseract.image_to_string(gray, config='--psm 7')
os.remove(filename)
print("Output: " + text)
If anyone has an idea what else I could to I'm very grateful!
solved: config has to be config='-psm 7' with only one "-"