Search code examples
opencvimagemagickbounding-boxgoogle-visionpython-tesseract

How to make bounding box around text-areas in an image? (Even if text is skewed!!)


I am trying to detect and grab text from a screenshot taken from any consumer product's ad.

My code works at a certain accuracy but fails to make bounding boxes around the skewed text area.

Recently I tried Google Vision API and it makes bounding boxes around almost every possible text area and detects text in that area with great accuracy. I am curious about how can I achieve the same or similar!

My test image:

enter image description here

Google Vision API after bounding boxes:

enter image description here

Thank you in advance:)


Solution

  • There are a few open source vision packages that are able to detect text in noisy background images, comparable to Google's Vision API.

    You can use a Fixed Convolution Layer simple architecture called EAST (Efficient and Accurate Scene Text Detector) by Zhou et al. https://arxiv.org/abs/1704.03155v2

    Using Python:

    Download the Pre-trained model from: https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz?dl=1 . Extract the model to your current folder.

    You will need OpenCV >= 3.4.2 to execute the below commands.

    import cv2
    import math
    net = cv2.dnn.readNet("frozen_east_text_detection.pb")   #This is the model we get after extraction
    frame = cv2.imread(<image_filename>)
    inpWidth = inpHeight = 320  # A default dimension
    # Preparing a blob to pass the image through the neural network
    # Subtracting mean values used while training the model.
    image_blob = cv2.dnn.blobFromImage(frame, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)
    

    Now we will have to define the output layers which churns out the positional values of the detected text and its confidence Score (through the Sigmoid Function)

    output_layer = []
    output_layer.append("feature_fusion/Conv_7/Sigmoid")
    output_layer.append("feature_fusion/concat_3")
    

    Finally we will do a Forward Propagation through the network to get the desired output.

    net.setInput(image_blob)
    output = net.forward(output_layer)
    scores = output[0]
    geometry = output[1]
    

    Here i have used the decode function defined in opencv's github page, https://github.com/opencv/opencv/blob/master/samples/dnn/text_detection.py to convert the positional values into box coordinates. (line 23 to 75).

    For box detection threshold i have used a value of 0.5 and for Non Max Suppression i have used 0.3. You can try different values to achieve better bounding boxes.

    confThreshold = 0.5
    nmsThreshold = 0.3
    [boxes, confidences] = decode(scores, geometry, confThreshold)
    indices = cv2.dnn.NMSBoxesRotated(boxes, confidences, confThreshold, nmsThreshold)
    

    Lastly, to overlay the boxes over the detected text in image:

    height_ = frame.shape[0]
    width_ = frame.shape[1]
    rW = width_ / float(inpWidth)
    rH = height_ / float(inpHeight)
    
    for i in indices:
        # get 4 corners of the rotated rect
        vertices = cv2.boxPoints(boxes[i[0]])
        # scale the bounding box coordinates based on the respective ratios
        for j in range(4):
            vertices[j][0] *= rW
            vertices[j][1] *= rH
        for j in range(4):
            p1 = (vertices[j][0], vertices[j][1])
            p2 = (vertices[(j + 1) % 4][0], vertices[(j + 1) % 4][1])
            cv2.line(frame, p1, p2, (0, 255, 0), 3)
    
    # To save the image:
    cv2.imwrite("maggi_boxed.jpg", frame)
    

    Maggi's Ad with bounding boxes

    I have not experimented with different values of threshold. Changing them will surely give better result and also remove the misclassifications of the logo as text.

    Note: The model was trained on English corpus, so Hindi words will not be detected. Also you can read the paper which outlines the test datasets it was bench-marked on.