I want to sort the words extracted from image in order of their occurence using contours detection

I am making an OCR, I am using contours detection, I have extracted words and drawn bounding boxes but the problem is that when I crop the individual word, they are not in sorted order. I have tried sorting methods mentioned in this link to sort the contours but they work best on objects but in my case i want to make the order exact. sometimes the sorting is not the best solution it changes pattern of words as different words have different size of bounding boxes in same line and values of 'x' and 'y' varies with it. Now in same line, words with large bounding boxes are considered as one category and small ones are considered as other category and they get sorted in same fashion.This is the code to sort.

    sorted_ctrs=sorted(ctrs, key=lambda ctr: cv2.boundingRect(ctr)[0] + cv2.boundingRect(ctr)[1] * 
    im.shape[1] )

image of extracted bounded boxes
this is what I get after cropping from sorted contours

Is there any other method which can arrange my words so that it makes some sense?

Solution

You should start by separating out the different lines. When you have done that, you can simply process the contours left to right (sorted from x = 0 to x = width )

Start by drawing the found contours on a black background. Next, sum the rows. The sum of rows without words/contours will be 0. There is usually some space between lines of text, which will have sum = 0. You can use this to find the min and max height values for each line of text.

To find the order of the words, first look for the contours in the y range of the first line, then for the lowest x.

Input:

Code:

import cv2
import numpy as np
# load image and get dimensions
img = cv2.imread('xmple2.png',0)
h,w = img.shape[:2]
# sum all rows
sumOfRows = np.sum(img, axis=1)

# loop the summed values
startindex = 0
lines = []
compVal = True
for i, val in enumerate(sumOfRows):
    # logical test to detect change between 0 and > 0
    testVal = (val > 0)
    if testVal == compVal:
            # when the value changed to a 0, the previous rows
            # contained contours, so add start/end index to list
            if val == 0:
                lines.append((startindex,i))
            # update startindex, invert logical test
                startindex = i+1
            compVal = not compVal

You use the lineslist to further process the contours. The following code results in a list with the contours ordered based on position, which you can see by the list index written on the image:

# create empty list
lineContours = []
# find contours (you already have this)
x, contours, hier = cv2.findContours(img,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
# loop contours, find the boundingrect,
# compare to line-values
# store line number,  x value and contour index in list
for j,cnt in enumerate(contours):
    (x,y,w,h) = cv2.boundingRect(cnt)
    for i,line in enumerate(lines):
        if y >= line[0] and y <= line[1]:
            lineContours.append([line[0],x,j])
            break

# sort list on line number,  x value and contour index
contours_sorted = sorted(lineContours)

# write list index on image
for i, cnt in enumerate(contours_sorted):
    line, xpos, cnt_index = cnt
    cv2.putText(img,str(i),(xpos,line+50),cv2.FONT_HERSHEY_SIMPLEX,1,(127),2,cv2.LINE_AA)

# show image
cv2.imshow('Img',img)
cv2.waitKey(0)
cv2.destroyAllWindows()

You can instead print the contour index:

# write contour index on image
for line, xpos, cnt_index in (contours_sorted):
    cv2.putText(img,str(cnt_index),(xpos,line+50),cv2.FONT_HERSHEY_SIMPLEX,1,(127),2,cv2.LINE_AA)

You could also create images for the separate lines:

# for each line found, create and display a subimage
for y1,y2 in lines:
    line = img[y1:y2,0:w]
    cv2.imshow('Img',line)
    cv2.waitKey(0)

cv2.destroyAllWindows()