opencv imagemagick bounding-box google-vision python-tesseract

How to make bounding box around text-areas in an image? (Even if text is skewed!!)

I am trying to detect and grab text from a screenshot taken from any consumer product's ad.

My code works at a certain accuracy but fails to make bounding boxes around the skewed text area.

Recently I tried Google Vision API and it makes bounding boxes around almost every possible text area and detects text in that area with great accuracy. I am curious about how can I achieve the same or similar!

My test image:

Google Vision API after bounding boxes:

Thank you in advance:)

Solution

There are a few open source vision packages that are able to detect text in noisy background images, comparable to Google's Vision API.

You can use a Fixed Convolution Layer simple architecture called EAST (Efficient and Accurate Scene Text Detector) by Zhou et al. https://arxiv.org/abs/1704.03155v2

Using Python:

Download the Pre-trained model from: https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz?dl=1 . Extract the model to your current folder.

You will need OpenCV >= 3.4.2 to execute the below commands.

import cv2
import math
net = cv2.dnn.readNet("frozen_east_text_detection.pb")   #This is the model we get after extraction
frame = cv2.imread(<image_filename>)
inpWidth = inpHeight = 320  # A default dimension
# Preparing a blob to pass the image through the neural network
# Subtracting mean values used while training the model.
image_blob = cv2.dnn.blobFromImage(frame, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)

Now we will have to define the output layers which churns out the positional values of the detected text and its confidence Score (through the Sigmoid Function)

output_layer = []
output_layer.append("feature_fusion/Conv_7/Sigmoid")
output_layer.append("feature_fusion/concat_3")

Finally we will do a Forward Propagation through the network to get the desired output.

net.setInput(image_blob)
output = net.forward(output_layer)
scores = output[0]
geometry = output[1]

Here i have used the decode function defined in opencv's github page, https://github.com/opencv/opencv/blob/master/samples/dnn/text_detection.py to convert the positional values into box coordinates. (line 23 to 75).

For box detection threshold i have used a value of 0.5 and for Non Max Suppression i have used 0.3. You can try different values to achieve better bounding boxes.

confThreshold = 0.5
nmsThreshold = 0.3
[boxes, confidences] = decode(scores, geometry, confThreshold)
indices = cv2.dnn.NMSBoxesRotated(boxes, confidences, confThreshold, nmsThreshold)

Lastly, to overlay the boxes over the detected text in image:

height_ = frame.shape[0]
width_ = frame.shape[1]
rW = width_ / float(inpWidth)
rH = height_ / float(inpHeight)

for i in indices:
    # get 4 corners of the rotated rect
    vertices = cv2.boxPoints(boxes[i[0]])
    # scale the bounding box coordinates based on the respective ratios
    for j in range(4):
        vertices[j][0] *= rW
        vertices[j][1] *= rH
    for j in range(4):
        p1 = (vertices[j][0], vertices[j][1])
        p2 = (vertices[(j + 1) % 4][0], vertices[(j + 1) % 4][1])
        cv2.line(frame, p1, p2, (0, 255, 0), 3)

# To save the image:
cv2.imwrite("maggi_boxed.jpg", frame)

I have not experimented with different values of threshold. Changing them will surely give better result and also remove the misclassifications of the logo as text.

Note: The model was trained on English corpus, so Hindi words will not be detected. Also you can read the paper which outlines the test datasets it was bench-marked on.