python pdf tesseract python-tesseract text-extraction

Extract hindi Text from a PDF file

I am working on a task to extract some information (in HINDI) from a pdf file and convert it into a data frame.

I have tried many things and followed many articles, and answers on stack overflow as well. I tried different libraries like easy OCR, paddle OCR, and others but was unable to get the correct output.

Here is the link for the document. Link

Things I have tried:

It seems that I am unable to get the exact contours to create the bounding box. Below you can see the image of the output I am getting.

All I need is to convert this information to a data frame where the columns would be:- नाम: पति का नाम / पिता का नाम: मकान संख्याः an so on.

Below is the code I am using to get data:-

import cv2
import pytesseract
import numpy as np
from pytesseract import Output
image = cv2.imread('pages_new/page3.jpg')
img = image.copy()
mask = np.zeros(image.shape, dtype=np.uint8)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# Filter for ROI using contour area and aspect ratio
countour = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
countour = countour[0] if len(countour) == 2 else countour[1]
for c in countour:
    area = cv2.contourArea(c)
    peri = cv2.arcLength(c, True)
    approx = cv2.approxPolyDP(c, 0.05 * peri, True)
    x,y,w,h = cv2.boundingRect(approx)
    aspect_ratio = w / float(h)
    if area > 10000 and aspect_ratio > .5:
        mask[y:y+h, x:x+w] = image[y:y+h, x:x+w]
h, w, c = img.shape
boxes = pytesseract.image_to_boxes(img) 
for b in boxes.splitlines():
    b = b.split(' ')
    img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)
d = pytesseract.image_to_data(img, output_type=Output.DICT)
n_boxes = len(d['text'])
for i in range(n_boxes):
    if int(d['conf'][i]) > 60:
        (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
        img2 = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
# cv2.imshow('img', img)
# cv2.imshow('img2', img2)
# Perfrom OCR with Pytesseract
data = pytesseract.image_to_string(mask, lang='Devanagari', config='--psm 6')
print(data)
# cv2.imshow('thresh', thresh)
# cv2.imshow('mask', mask)

Also, Could anyone please confirm if the information on the page changes, Do we have to write a different code for all the documents or we can get a generic script for all the docs?

Solution

I am able to get the data using the below code:-

import cv2
import numpy as np
import pdf2image
import pytesseract

# Extract page 3 from PDF in proper quality
page_3 = np.array(pdf2image.convert_from_path('ROLL_Download.aspx.pdf',
                                              first_page=3, last_page=3,
                                              dpi=300, grayscale=True)[0])

# Inverse binarize for contour finding
thr = cv2.threshold(page_3, 128, 255, cv2.THRESH_BINARY_INV)[1]

cnts = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
cnts_tables = [cnt for cnt in cnts if cv2.contourArea(cnt) > 10000]
no_tables = cv2.drawContours(thr.copy(), cnts_tables, -1, 0, cv2.FILLED)
data = []
rects = sorted([cv2.boundingRect(cnt) for cnt in cnts_tables], key=lambda r: (r[1], r[0]))
for i_r, (x, y, w, h) in enumerate(rects, start=1):
    
    cnts = cv2.findContours(page_3[y+1:y+h-1, x+1:x+w-1], cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]    
    
    inner_rects = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda r: (r[1], r[0]))
    
    print('\nExtract texts inside table {}\n'.format(i_r))
    for (xx, yy, ww, hh) in inner_rects:
        # Set current coordinates w.r.t. full image
        xx += x
        yy += y

        # Get current cell
        cell = page_3[yy+2:yy+hh-2, xx+2:xx+ww-2]
            
        # Floodfill rectangles around numbers
        ys, xs = np.min(np.argwhere(cell == 0), axis=0)
        print("The value for xs:{} and ys:{}".format(xs, ys))
        temp = cv2.floodFill(cell.copy(), None, (xs, ys), 255)[1]
        mask = cv2.morphologyEx(thr[yy+2:yy+hh-2, xx+2:xx+ww-2].copy(), cv2.MORPH_DILATE, np.full((ww, hh),255))

cnts = cv2.findContours(mask, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE)
        cnts = cnts[0] if len(cnts) == 2 else cnts[1]
        boxes = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda b: b[0])

        # Extract texts from each part of the current cell
        for x_b, y_b, w_b, h_b in boxes:
#             print("The value for i_b is:",i_b)

            text = pytesseract.image_to_string(
                temp[y_b:y_b+h_b, x_b:x_b+w_b],
                config='--psm 6',
                lang='Devanagari')
            #text = text.replace('\f', '')
            print('x: {}, y: {}, text:\n{}'.format(xx, yy, text))
            data.append(text)

Below is the output:-

Extract texts inside table 1
    
    The value for xs:13 and ys:0
    x: 103, y: 209, text:
    1 WEZ1761006
    नाम : भीमसेन
    पिता का नाम : बच्चू सिंह
    मकान संख्या: देव नगर Photo is
    आयु : 33 लिंग : पुरुष Available
    
    The value for xs:13 and ys:0
    x: 857, y: 209, text:
    2 WEZ1391713
    नाम : पूजा कुमारी
    पिता का नाम : विपिन सोनी
    मकान संख्याः वार्ड नं1 Photo is
    आयु : 23 लिग : स्त्री Available
    
    The value for xs:13 and ys:0
    x: 1610, y: 209, text:
    3 WEZ1781897
    नाम : सोनू
    पति का नाम : राजू
    मकान संख्याः वार्ड नं2 Photo is
    आयु : 3 लिग : स्त्री Available
    
    
    Extract texts inside table 2
    
    The value for xs:13 and ys:41
    x: 103, y: 507, text:
    #174 WEZ1735174
    नाम : रागिणी कुमारी कामत
    पिता का नाम : संतोष कामत
    मकान संख्याः 31 Photo is
    आयु : 19 लिग : स्त्री Available
    
    The value for xs:13 and ys:41
    x: 857, y: 507, text:
    5 WEZ1766005
    नाम : पर्तीक सिंग चिडे
    माता का नाम : कुलविंदर कौर
    मकान संख्याः देव नगर ,वार्ड नं. 2 Photo is
    आयु : 20 लिग : पुरुष Available
    
    The value for xs:13 and ys:41
    x: 1610, y: 507, text:
    [|] WEZ1755230
    नाम : रीता देवी
    पति का नाम : प्रेम यादव
    मकान संख्या: हाऊस नं. 05 Photo is
    आयु : ॐ लिग : स्त्री Available
    
    
    Extract texts inside table 3
    
    The value for xs:13 and ys:0
    x: 103, y: 807, text:
    7 WEZ1758721
    नाम : विश्व जीत वर्मा
    पिता का नाम : राम चद्र
    मकान संख्या: हाऊस नं. 10, वार्ड नं. 2 Photo is
    आयु : 25 लिंग : पुरुष Available
    
    The value for xs:13 and ys:0
    x: 857, y: 807, text:
    । | WEZ1758739
    नाम : हिम्मत वर्मा
    पिता का नाम : राम चद्र
    मकान संख्या: हाऊस नं. 10, वार्ड नं. 2 Photo is
    आयु : 23 लिंग : पुरुष Available
    
    The value for xs:13 and ys:0
    x: 1610, y: 807, text:
    [१ WEZ1427087
    नाम : सोनू यादव
    पिता का नाम : ददन यादव
    मकान संख्या: हाऊस नं. 228 Photo is
    आयु : 23 लिंग : पुरुष Available