I have used the following PyTorch implementation of EAST ( Efficient and Accurate Scene Text Detector) to identify and draw bounding boxes around text in a number of images and it works very well!
However, the next step of OCR which I am trying with pytesseract
in order to extract the text form these images and converting them to strings - is failing horribly. Using all possible configurations of --oem
and --psm
, I am unable to get pytesseract
to detect what appears to be very clear text, for example:
The recognized text is below the images. Even though I have applied contrast enhancement, and also tried dilating and eroding, I cannot get tesseract to recognize the text. This is just one example of many images where the text is even larger and clearer. Any suggestions on transformations, configs, or other libraries would be helpful!
UPDATE: After trying Gaussian blur + Otso thresholding, I am able to get black text on white background (apparently which is ideal for pytesseract), and also added Spanish language, but it still cannot read very plain text - for example:
reads as gibberish.
The processed text images are and and the code I am using:
img_path = './images/fesa.jpg'
img = Image.open(img_path)
boxes = detect(img, model, device)
origbw = cv2.imread(img_path, 0)
for box in boxes:
box = box[:-1]
poly = [(box[0], box[1]),(box[2], box[3]),(box[4], box[5]),(box[6], box[7])]
x = []
y = []
for coord in poly:
x.append(coord[0])
y.append(coord[1])
startX = int(min(x))
startY = int(min(y))
endX = int(max(x))
endY = int(max(y))
#use pre-defined bounding boxes produced by EAST to crop the original image
cropped_image = origbw[startY:endY, startX:endX]
#contrast enhancement
clahe = cv2.createCLAHE(clipLimit=4.0, tileGridSize=(8,8))
res = clahe.apply(cropped_image)
text = pytesseract.image_to_string(res, config = "-psm 12")
plt.imshow(res)
plt.show()
print(text)
Use these updated data files.
This guide criticizes out-of-the box performance (and maybe the accuracy could be affected too):
Trained data. On the moment of writing, tesseract-ocr-eng APT package for Ubuntu 18.10 has terrible out of the box performance, likely because of corrupt training data.
According to the following test I did, using the updated data files seems to provide better results. This is the code I used:
import pytesseract
from PIL import Image
print(pytesseract.image_to_string(Image.open('farmacias.jpg'), lang='spa', config='--tessdata-dir ./tessdata --psm 7'))
I downloaded spa.traineddata (your example images have Spanish words, right?) to ./tessdata/spa.traineddata
. And the result was:
ARMACIAS
And for the second image:
PECIALIZADA:
I used --psm 7
because here it says that it means "Treat the image as a single text line" and I thought that should make sense for your test images.
In this Google Colab you can see the test I did.