Search code examples
pythonregexpython-3.xdata-extraction

Unable to extract date of birth from a given format


I have a set of text files from which I have to extract date of birth. The below code is able to extract date of birth from most of the files but is getting failed when given in the below format. May I know how could I extract DOB? The data is very much un-uniform.

Data:

data="""
Thomas, John - DOB/Sex:    12/23/1955                                     11/15/2014   11:53 AM"
Jacob's Date of birth is 9/15/1963
Name:Annie; DOB:10/30/1970

Code:

import re    
pattern = re.compile(r'.*DOB.*((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?(?:\/|-)\d{2,4})).*',re.I)

matches=pattern.findall(data)

for match in matches:
    print(match)

expected output:

12/23/1955

Solution

  • import re    
    
    data="""
    Thomas, John - DOB/Sex:    12/23/1955                                     11/15/2014   11:53 AM"
    Jacob's Date of birth is 9/15/1963
    Name:Annie; DOB:10/30/1970
    """
    
    pattern = re.compile(r'.*?\b(?:DOB|Date of birth)\b.*?(\d{1,2}[/-]\d{1,2}[/-](?:\d\d){1,2})',re.I)
    
    matches=pattern.findall(data)
    
    for match in matches:
        print(match)    
    

    Output:

    12/23/1955
    9/15/1963
    10/30/1970
    

    Explanation:

    .*?             : 0 or more anycharacter but newline
    \b              : word boundary
    (?:             : start non capture group
      DOB           : literally
     |              : OR
      Date of birth : literally
    )               : end group
    \b              : word boundary
    .*?             : 0 or more anycharacter but newline
    (               : start group 1
        \d{1,2}     : 1 or 2 digits
        [/-]        : slash or dash
        \d{1,2}     : 1 or 2 digits
        [/-]        : slash or dash
        (?:         : start non capture group
            \d\d    : 2 digits
        ){1,2}      : end group may appear 1 or twice (ie; 2 OR 4 digits)
    )               : end capture group 1