I am working on a task to extract some information (in HINDI) from a pdf file and convert it into a data frame.
I have tried many things and followed many articles, and answers on stack overflow as well. I tried different libraries like easy OCR, paddle OCR, and others but was unable to get the correct output.
Here is the link for the document. Link
Things I have tried:
It seems that I am unable to get the exact contours to create the bounding box. Below you can see the image of the output I am getting.
All I need is to convert this information to a data frame where the columns would be:- नाम: पति का नाम / पिता का नाम: मकान संख्याः an so on.
Below is the code I am using to get data:-
import cv2
import pytesseract
import numpy as np
from pytesseract import Output
image = cv2.imread('pages_new/page3.jpg')
img = image.copy()
mask = np.zeros(image.shape, dtype=np.uint8)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# Filter for ROI using contour area and aspect ratio
countour = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
countour = countour[0] if len(countour) == 2 else countour[1]
for c in countour:
area = cv2.contourArea(c)
peri = cv2.arcLength(c, True)
approx = cv2.approxPolyDP(c, 0.05 * peri, True)
x,y,w,h = cv2.boundingRect(approx)
aspect_ratio = w / float(h)
if area > 10000 and aspect_ratio > .5:
mask[y:y+h, x:x+w] = image[y:y+h, x:x+w]
h, w, c = img.shape
boxes = pytesseract.image_to_boxes(img)
for b in boxes.splitlines():
b = b.split(' ')
img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)
d = pytesseract.image_to_data(img, output_type=Output.DICT)
n_boxes = len(d['text'])
for i in range(n_boxes):
if int(d['conf'][i]) > 60:
(x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
img2 = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
# cv2.imshow('img', img)
# cv2.imshow('img2', img2)
# Perfrom OCR with Pytesseract
data = pytesseract.image_to_string(mask, lang='Devanagari', config='--psm 6')
print(data)
# cv2.imshow('thresh', thresh)
# cv2.imshow('mask', mask)
Also, Could anyone please confirm if the information on the page changes, Do we have to write a different code for all the documents or we can get a generic script for all the docs?
I am able to get the data using the below code:-
import cv2
import numpy as np
import pdf2image
import pytesseract
# Extract page 3 from PDF in proper quality
page_3 = np.array(pdf2image.convert_from_path('ROLL_Download.aspx.pdf',
first_page=3, last_page=3,
dpi=300, grayscale=True)[0])
# Inverse binarize for contour finding
thr = cv2.threshold(page_3, 128, 255, cv2.THRESH_BINARY_INV)[1]
cnts = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
cnts_tables = [cnt for cnt in cnts if cv2.contourArea(cnt) > 10000]
no_tables = cv2.drawContours(thr.copy(), cnts_tables, -1, 0, cv2.FILLED)
data = []
rects = sorted([cv2.boundingRect(cnt) for cnt in cnts_tables], key=lambda r: (r[1], r[0]))
for i_r, (x, y, w, h) in enumerate(rects, start=1):
cnts = cv2.findContours(page_3[y+1:y+h-1, x+1:x+w-1], cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
inner_rects = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda r: (r[1], r[0]))
print('\nExtract texts inside table {}\n'.format(i_r))
for (xx, yy, ww, hh) in inner_rects:
# Set current coordinates w.r.t. full image
xx += x
yy += y
# Get current cell
cell = page_3[yy+2:yy+hh-2, xx+2:xx+ww-2]
# Floodfill rectangles around numbers
ys, xs = np.min(np.argwhere(cell == 0), axis=0)
print("The value for xs:{} and ys:{}".format(xs, ys))
temp = cv2.floodFill(cell.copy(), None, (xs, ys), 255)[1]
mask = cv2.morphologyEx(thr[yy+2:yy+hh-2, xx+2:xx+ww-2].copy(), cv2.MORPH_DILATE, np.full((ww, hh),255))
cnts = cv2.findContours(mask, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
boxes = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda b: b[0])
# Extract texts from each part of the current cell
for x_b, y_b, w_b, h_b in boxes:
# print("The value for i_b is:",i_b)
text = pytesseract.image_to_string(
temp[y_b:y_b+h_b, x_b:x_b+w_b],
config='--psm 6',
lang='Devanagari')
#text = text.replace('\f', '')
print('x: {}, y: {}, text:\n{}'.format(xx, yy, text))
data.append(text)
Below is the output:-
Extract texts inside table 1
The value for xs:13 and ys:0
x: 103, y: 209, text:
1 WEZ1761006
नाम : भीमसेन
पिता का नाम : बच्चू सिंह
मकान संख्या: देव नगर Photo is
आयु : 33 लिंग : पुरुष Available
The value for xs:13 and ys:0
x: 857, y: 209, text:
2 WEZ1391713
नाम : पूजा कुमारी
पिता का नाम : विपिन सोनी
मकान संख्याः वार्ड नं1 Photo is
आयु : 23 लिग : स्त्री Available
The value for xs:13 and ys:0
x: 1610, y: 209, text:
3 WEZ1781897
नाम : सोनू
पति का नाम : राजू
मकान संख्याः वार्ड नं2 Photo is
आयु : 3 लिग : स्त्री Available
Extract texts inside table 2
The value for xs:13 and ys:41
x: 103, y: 507, text:
#174 WEZ1735174
नाम : रागिणी कुमारी कामत
पिता का नाम : संतोष कामत
मकान संख्याः 31 Photo is
आयु : 19 लिग : स्त्री Available
The value for xs:13 and ys:41
x: 857, y: 507, text:
5 WEZ1766005
नाम : पर्तीक सिंग चिडे
माता का नाम : कुलविंदर कौर
मकान संख्याः देव नगर ,वार्ड नं. 2 Photo is
आयु : 20 लिग : पुरुष Available
The value for xs:13 and ys:41
x: 1610, y: 507, text:
[|] WEZ1755230
नाम : रीता देवी
पति का नाम : प्रेम यादव
मकान संख्या: हाऊस नं. 05 Photo is
आयु : ॐ लिग : स्त्री Available
Extract texts inside table 3
The value for xs:13 and ys:0
x: 103, y: 807, text:
7 WEZ1758721
नाम : विश्व जीत वर्मा
पिता का नाम : राम चद्र
मकान संख्या: हाऊस नं. 10, वार्ड नं. 2 Photo is
आयु : 25 लिंग : पुरुष Available
The value for xs:13 and ys:0
x: 857, y: 807, text:
। | WEZ1758739
नाम : हिम्मत वर्मा
पिता का नाम : राम चद्र
मकान संख्या: हाऊस नं. 10, वार्ड नं. 2 Photo is
आयु : 23 लिंग : पुरुष Available
The value for xs:13 and ys:0
x: 1610, y: 807, text:
[१ WEZ1427087
नाम : सोनू यादव
पिता का नाम : ददन यादव
मकान संख्या: हाऊस नं. 228 Photo is
आयु : 23 लिंग : पुरुष Available