Search code examples
pythonpdftesseractpython-tesseracttext-extraction

Extract hindi Text from a PDF file


I am working on a task to extract some information (in HINDI) from a pdf file and convert it into a data frame.

I have tried many things and followed many articles, and answers on stack overflow as well. I tried different libraries like easy OCR, paddle OCR, and others but was unable to get the correct output.

Here is the link for the document. Link

Things I have tried:

  1. How to improve Hindi text extraction?
  2. How do I display the contours of an image using OpenCV Python?
  3. https://amannair723.medium.com/pdf-to-excel-using-advance-python-nlp-and-computer-vision-aka-document-ai-23cc0fb56549

It seems that I am unable to get the exact contours to create the bounding box. Below you can see the image of the output I am getting.

Output Image

All I need is to convert this information to a data frame where the columns would be:- नाम: पति का नाम / पिता का नाम: मकान संख्याः an so on.

Below is the code I am using to get data:-

import cv2
import pytesseract
import numpy as np
from pytesseract import Output
image = cv2.imread('pages_new/page3.jpg')
img = image.copy()
mask = np.zeros(image.shape, dtype=np.uint8)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# Filter for ROI using contour area and aspect ratio
countour = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
countour = countour[0] if len(countour) == 2 else countour[1]
for c in countour:
    area = cv2.contourArea(c)
    peri = cv2.arcLength(c, True)
    approx = cv2.approxPolyDP(c, 0.05 * peri, True)
    x,y,w,h = cv2.boundingRect(approx)
    aspect_ratio = w / float(h)
    if area > 10000 and aspect_ratio > .5:
        mask[y:y+h, x:x+w] = image[y:y+h, x:x+w]
h, w, c = img.shape
boxes = pytesseract.image_to_boxes(img) 
for b in boxes.splitlines():
    b = b.split(' ')
    img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)
d = pytesseract.image_to_data(img, output_type=Output.DICT)
n_boxes = len(d['text'])
for i in range(n_boxes):
    if int(d['conf'][i]) > 60:
        (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
        img2 = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
# cv2.imshow('img', img)
# cv2.imshow('img2', img2)
# Perfrom OCR with Pytesseract
data = pytesseract.image_to_string(mask, lang='Devanagari', config='--psm 6')
print(data)
# cv2.imshow('thresh', thresh)
# cv2.imshow('mask', mask)

Also, Could anyone please confirm if the information on the page changes, Do we have to write a different code for all the documents or we can get a generic script for all the docs?


Solution

  • I am able to get the data using the below code:-

    import cv2
    import numpy as np
    import pdf2image
    import pytesseract
    
    # Extract page 3 from PDF in proper quality
    page_3 = np.array(pdf2image.convert_from_path('ROLL_Download.aspx.pdf',
                                                  first_page=3, last_page=3,
                                                  dpi=300, grayscale=True)[0])
    
    # Inverse binarize for contour finding
    thr = cv2.threshold(page_3, 128, 255, cv2.THRESH_BINARY_INV)[1]
    
    cnts = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    cnts_tables = [cnt for cnt in cnts if cv2.contourArea(cnt) > 10000]
    no_tables = cv2.drawContours(thr.copy(), cnts_tables, -1, 0, cv2.FILLED)
    data = []
    rects = sorted([cv2.boundingRect(cnt) for cnt in cnts_tables], key=lambda r: (r[1], r[0]))
    for i_r, (x, y, w, h) in enumerate(rects, start=1):
        
        cnts = cv2.findContours(page_3[y+1:y+h-1, x+1:x+w-1], cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
        cnts = cnts[0] if len(cnts) == 2 else cnts[1]    
        
        inner_rects = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda r: (r[1], r[0]))
        
        print('\nExtract texts inside table {}\n'.format(i_r))
        for (xx, yy, ww, hh) in inner_rects:
            # Set current coordinates w.r.t. full image
            xx += x
            yy += y
    
            # Get current cell
            cell = page_3[yy+2:yy+hh-2, xx+2:xx+ww-2]
                
            # Floodfill rectangles around numbers
            ys, xs = np.min(np.argwhere(cell == 0), axis=0)
            print("The value for xs:{} and ys:{}".format(xs, ys))
            temp = cv2.floodFill(cell.copy(), None, (xs, ys), 255)[1]
            mask = cv2.morphologyEx(thr[yy+2:yy+hh-2, xx+2:xx+ww-2].copy(), cv2.MORPH_DILATE, np.full((ww, hh),255))
    
    cnts = cv2.findContours(mask, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE)
            cnts = cnts[0] if len(cnts) == 2 else cnts[1]
            boxes = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda b: b[0])
    
            # Extract texts from each part of the current cell
            for x_b, y_b, w_b, h_b in boxes:
    #             print("The value for i_b is:",i_b)
    
                text = pytesseract.image_to_string(
                    temp[y_b:y_b+h_b, x_b:x_b+w_b],
                    config='--psm 6',
                    lang='Devanagari')
                #text = text.replace('\f', '')
                print('x: {}, y: {}, text:\n{}'.format(xx, yy, text))
                data.append(text)
    

    Below is the output:-

    Extract texts inside table 1
        
        The value for xs:13 and ys:0
        x: 103, y: 209, text:
        1 WEZ1761006
        नाम : भीमसेन
        पिता का नाम : बच्चू सिंह
        मकान संख्या: देव नगर Photo is
        आयु : 33 लिंग : पुरुष Available
        
        The value for xs:13 and ys:0
        x: 857, y: 209, text:
        2 WEZ1391713
        नाम : पूजा कुमारी
        पिता का नाम : विपिन सोनी
        मकान संख्याः वार्ड नं1 Photo is
        आयु : 23 लिग : स्त्री Available
        
        The value for xs:13 and ys:0
        x: 1610, y: 209, text:
        3 WEZ1781897
        नाम : सोनू
        पति का नाम : राजू
        मकान संख्याः वार्ड नं2 Photo is
        आयु : 3 लिग : स्त्री Available
        
        
        Extract texts inside table 2
        
        The value for xs:13 and ys:41
        x: 103, y: 507, text:
        #174 WEZ1735174
        नाम : रागिणी कुमारी कामत
        पिता का नाम : संतोष कामत
        मकान संख्याः 31 Photo is
        आयु : 19 लिग : स्त्री Available
        
        The value for xs:13 and ys:41
        x: 857, y: 507, text:
        5 WEZ1766005
        नाम : पर्तीक सिंग चिडे
        माता का नाम : कुलविंदर कौर
        मकान संख्याः देव नगर ,वार्ड नं. 2 Photo is
        आयु : 20 लिग : पुरुष Available
        
        The value for xs:13 and ys:41
        x: 1610, y: 507, text:
        [|] WEZ1755230
        नाम : रीता देवी
        पति का नाम : प्रेम यादव
        मकान संख्या: हाऊस नं. 05 Photo is
        आयु : ॐ लिग : स्त्री Available
        
        
        Extract texts inside table 3
        
        The value for xs:13 and ys:0
        x: 103, y: 807, text:
        7 WEZ1758721
        नाम : विश्व जीत वर्मा
        पिता का नाम : राम चद्र
        मकान संख्या: हाऊस नं. 10, वार्ड नं. 2 Photo is
        आयु : 25 लिंग : पुरुष Available
        
        The value for xs:13 and ys:0
        x: 857, y: 807, text:
        । | WEZ1758739
        नाम : हिम्मत वर्मा
        पिता का नाम : राम चद्र
        मकान संख्या: हाऊस नं. 10, वार्ड नं. 2 Photo is
        आयु : 23 लिंग : पुरुष Available
        
        The value for xs:13 and ys:0
        x: 1610, y: 807, text:
        [१ WEZ1427087
        नाम : सोनू यादव
        पिता का नाम : ददन यादव
        मकान संख्या: हाऊस नं. 228 Photo is
        आयु : 23 लिंग : पुरुष Available