cv2.rectangle join closest bounding box

I'm trying to isolate medieval manuscripts words in a scanned page. I'm using cv2 to detect zones ant id gives me quite a satisfying result. I labeled every rectangles with an increment number and I'm worryin about the fact that detected zones are not contiguous : Here is a sample result of cv2 bounding box zones on a word

Here is the code I used:

import numpy as np
import cv2
import matplotlib.pyplot as plt
# This is font for labels
font = cv2.FONT_HERSHEY_SIMPLEX
# I load a picture of a page, gray and blur it
im = cv2.imread('test.png')
imgray = cv2.cvtColor(im,cv2.COLOR_BGR2GRAY)
image_blurred = cv2.GaussianBlur(imgray, (5, 5), 0)
image_blurred = cv2.dilate(image_blurred, None)
ret,thresh = cv2.threshold(image_blurred,0,255,0,cv2.THRESH_BINARY + cv2.THRESH_OTSU)
# I try to retrieve contours and hierarchy on the sample
_, contours, hierarchy =    cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)
hierarchy = hierarchy[0]
# I read every contours and retrieve the bounding box 
for i,component in enumerate(zip(contours, hierarchy)):
    cnt = component[0]
    currentHierarchy = component[1]
    precision = 0.01
    epsilon = precision*cv2.arcLength(cnt,True)
    approx = cv2.approxPolyDP(cnt,epsilon,True)
    # This is the best combination I found to isolate parents container
    # It gives me the best result (even if I'm not sure what I'm doing)
    # hierarchy[2/3] is "having child" / "having parent"
    # I thought  currentHierarchy[3] < 0 should be better
    # but it gives no result
    if currentHierarchy[2] > 0 and currentHierarchy[3] > 0:
        x,y,w,h = cv2.boundingRect(approx)
        cv2.rectangle(im,(x,y),(x+w,y+h),(0,255,0),2)
        cv2.putText(im,str(i),(x+2,y+2), font, 1,(0,255,0),2,cv2.LINE_AA)

plt.imshow(im)
plt.show()

I would like to join closest zones together in order to get a word tokenization of my page. In my sample picture, I would like to join 2835, 2847, 2864, 2878, 2870 and 2868.

How should I do ? I thought I could store in an array every coordinates of every boxes then test (start_x, start_y) and (end_x,end_y) - but it seems crappy to me.

Could you please give a hint ?

Thanks,

Solution

I proceeded with my approach to figure out individual words. Though not perfectly accurate have look at this image below:

Pseudocode:

Applied Gaussian blur to the grayscale image.
Performed Otsu's threshold.
Performed a couple of morphological operations:

3.1 Erosion to try to remove that thin line in the top-left side of the image.

3.2 Dilation to join single letters separated due to the previous operation.
Found contours above a certain area and marked them

EDIT

Code:

import numpy as np
import cv2
import matplotlib.pyplot as plt
font = cv2.FONT_HERSHEY_SIMPLEX

im = cv2.imread('corpus.png')
imgray = cv2.cvtColor(im,cv2.COLOR_BGR2GRAY)
image_blurred = cv2.GaussianBlur(imgray, (9, 9), 0)
cv2.imshow('blur', image_blurred)

image_blurred_d = cv2.dilate(image_blurred, None)
cv2.imshow('dilated_blur', image_blurred_d)

ret,thresh = cv2.threshold(image_blurred_d, 127, 255, cv2.THRESH_BINARY_INV +     cv2.THRESH_OTSU)
cv2.imshow('thresh', thresh)

kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (3, 3))
erosion = cv2.erode(thresh, kernel, iterations = 1)
cv2.imshow('erosion', erosion)

kernel1 = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
dilation = cv2.dilate(erosion, kernel1, iterations = 1)
cv2.imshow('dilation', dilation)

_, contours, hierarchy =    cv2.findContours(dilation, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
count = 0
for cnt in contours:
    if (cv2.contourArea(cnt) > 100):
        x, y, w, h = cv2.boundingRect(cnt)
        cv2.rectangle(im, (x,y), (x+w,y+h), (0, 255, 0), 2)
        count+=1
print('Number of probable words', count)

cv2.imshow('final', im)
cv2.waitKey(0)
cv2.destroyAllWindows()