python mongodb ocr text-parsing feature-extraction

Parsing date from OCR response in Python

I am trying to read date from an OCR response of an image. The OCR output is something like this.

\nPatientsName:KantibhaiPatelAgeISex:71YearslMale\nRef.by:Dr.KetanShuklaMS.MCH.\nReg.Date:29/06/201519;03\nLabRefNo;ARY-8922-15ReportingDate.29/06/201519:10\nHEMOGRAMREPORT\nTESTRESULTREFERENCEINTERVAL\n

I am interested in extracting the reporting date i.e. 29/06/2015. Also I am interested in storing the patient details in a database (MongoDB) chronologically. Hence I need to store the date in a standardized format for easy future queries. All suggestions are welcomed.

Edit - Since the data is coming as an OCR response there tends to be a lot of noise and sometimes misinterpreted characters. Is there any method that can have a better fault tolerance for string searching.

re.search(r'Date:([0-9]{2}\/[0-9]{2}\/[0-9]{4})', ocr_response).group(1)

The above statement explicitly looks for numbers, but what if some number is not read or misinterpeted as a character ?

Solution

use re module:

import re

print re.search(r'[Date:]*([0-9]{0,2}[\/-]([0-9]{0,2}|[a-z]{3})[\/-][0-9]{0,4})', ocr_response).group(1)

Output:

29/06/2015