I am trying to read date from an OCR response of an image. The OCR output is something like this.
\nPatientsName:KantibhaiPatelAgeISex:71YearslMale\nRef.by:Dr.KetanShuklaMS.MCH.\nReg.Date:29/06/201519;03\nLabRefNo;ARY-8922-15ReportingDate.29/06/201519:10\nHEMOGRAMREPORT\nTESTRESULTREFERENCEINTERVAL\n
I am interested in extracting the reporting date i.e. 29/06/2015. Also I am interested in storing the patient details in a database (MongoDB) chronologically. Hence I need to store the date in a standardized format for easy future queries. All suggestions are welcomed.
Edit - Since the data is coming as an OCR response there tends to be a lot of noise and sometimes misinterpreted characters. Is there any method that can have a better fault tolerance for string searching.
re.search(r'Date:([0-9]{2}\/[0-9]{2}\/[0-9]{4})', ocr_response).group(1)
The above statement explicitly looks for numbers, but what if some number is not read or misinterpeted as a character ?
use re
module:
import re
print re.search(r'[Date:]*([0-9]{0,2}[\/-]([0-9]{0,2}|[a-z]{3})[\/-][0-9]{0,4})', ocr_response).group(1)
Output:
29/06/2015