python python-imaging-library ocr tesseract python-tesseract

Easily readable text not recognized by tesseract

I have used the following PyTorch implementation of EAST ( Efficient and Accurate Scene Text Detector) to identify and draw bounding boxes around text in a number of images and it works very well!

However, the next step of OCR which I am trying with pytesseract in order to extract the text form these images and converting them to strings - is failing horribly. Using all possible configurations of --oem and --psm, I am unable to get pytesseract to detect what appears to be very clear text, for example:

The recognized text is below the images. Even though I have applied contrast enhancement, and also tried dilating and eroding, I cannot get tesseract to recognize the text. This is just one example of many images where the text is even larger and clearer. Any suggestions on transformations, configs, or other libraries would be helpful!

UPDATE: After trying Gaussian blur + Otso thresholding, I am able to get black text on white background (apparently which is ideal for pytesseract), and also added Spanish language, but it still cannot read very plain text - for example:

reads as gibberish.

The processed text images are and and the code I am using:


img_path = './images/fesa.jpg'
img = Image.open(img_path)
boxes = detect(img, model, device)
origbw = cv2.imread(img_path, 0)

for box in boxes:
    
    box = box[:-1]
    poly = [(box[0], box[1]),(box[2], box[3]),(box[4], box[5]),(box[6], box[7])]
    x = []
    y = []

    for coord in poly:
        x.append(coord[0])
        y.append(coord[1])

    startX = int(min(x))
    startY = int(min(y))
    endX = int(max(x))
    endY = int(max(y))


    #use pre-defined bounding boxes produced by EAST to crop the original image 
    
    cropped_image = origbw[startY:endY, startX:endX]

    #contrast enhancement 

    clahe = cv2.createCLAHE(clipLimit=4.0, tileGridSize=(8,8))
    res = clahe.apply(cropped_image)

    text = pytesseract.image_to_string(res, config = "-psm 12")
    
    plt.imshow(res)
    plt.show()
    print(text)

Solution

Use these updated data files.

This guide criticizes out-of-the box performance (and maybe the accuracy could be affected too):

Trained data. On the moment of writing, tesseract-ocr-eng APT package for Ubuntu 18.10 has terrible out of the box performance, likely because of corrupt training data.

According to the following test I did, using the updated data files seems to provide better results. This is the code I used:

import pytesseract
from PIL import Image
print(pytesseract.image_to_string(Image.open('farmacias.jpg'), lang='spa', config='--tessdata-dir ./tessdata --psm 7'))

I downloaded spa.traineddata (your example images have Spanish words, right?) to ./tessdata/spa.traineddata. And the result was:

ARMACIAS

And for the second image:

PECIALIZADA:

I used --psm 7 because here it says that it means "Treat the image as a single text line" and I thought that should make sense for your test images.

In this Google Colab you can see the test I did.