Search code examples
pythonocrtesseract

OCR for Bank Receipts


enter image description hereI am working on OCR problem for Bank receipts and I need to extract details like the Date and Account Number for the same. After processing the input,I am using Tessaract-OCR (using pyteserract in python) for the same.I have obtained the hocr output file however I am not able to make sense of it.How do we extract information from the HOCR output file?Note that the receipt has numbers filled in Boxes like the normal forms.

I used the below text for extraction.Should I use a different Encoding?

import os
if  os.path.isfile('output.hocr'):
    fp=open('output.hocr','r',encoding='UTF-8')
    text=fp.read()
    fp.close()

Note:The attached image is one example of data.These images are available in pdf files which I am converting programmatically into images.


Solution

  • I personally would use something more like tesseract to do the OCR and then perhaps something like opencv with surf for the tick boxes...

    or even do edge detection with opencv and surf for each section and ocr that specific area to make it more robust by analyzing that specific area rather than the whole document..