Search code examples
pythonmongodbocrtext-parsingfeature-extraction

Parsing date from OCR response in Python


I am trying to read date from an OCR response of an image. The OCR output is something like this.

\nPatientsName:KantibhaiPatelAgeISex:71YearslMale\nRef.by:Dr.KetanShuklaMS.MCH.\nReg.Date:29/06/201519;03\nLabRefNo;ARY-8922-15ReportingDate.29/06/201519:10\nHEMOGRAMREPORT\nTESTRESULTREFERENCEINTERVAL\n

I am interested in extracting the reporting date i.e. 29/06/2015. Also I am interested in storing the patient details in a database (MongoDB) chronologically. Hence I need to store the date in a standardized format for easy future queries. All suggestions are welcomed.

Edit - Since the data is coming as an OCR response there tends to be a lot of noise and sometimes misinterpreted characters. Is there any method that can have a better fault tolerance for string searching.

re.search(r'Date:([0-9]{2}\/[0-9]{2}\/[0-9]{4})', ocr_response).group(1)

The above statement explicitly looks for numbers, but what if some number is not read or misinterpeted as a character ?


Solution

  • use re module:

    import re
    
    print re.search(r'[Date:]*([0-9]{0,2}[\/-]([0-9]{0,2}|[a-z]{3})[\/-][0-9]{0,4})', ocr_response).group(1)
    

    Output:

    29/06/2015