Search code examples
regexpython-3.xfindall

match multiple OR conditions in python 3 regex findall


In python 3:

This is the Office of Foreign Asset Control list where individuals' assets should be monitored

https://www.treasury.gov/ofac/downloads/sdn.csv

a lot of their data of births (the very last column, comma delimited) are like

DOB 23 Jun 1959; alt. DOB 23 Jun 1958

or

DOB 1959; alt. DOB 1958

I am trying to capture all the birthdates after the keyword "DOB" AND "alt. DOB" with the following codes:

   if len(x.split(';')) > 0:
        if len(re.findall('DOB (.*)', x.split(';')[0])) > 0:
            new = re.findall('DOB | alt. DOB (.*)', x.split(';')[0])[0]
            print(new)

            try:
                print(datetime.strptime(new, '%d %b %Y'))
                return datetime.strptime(new, '%d %b %Y')
            except:
                return None

But the codes only get the birthdate right after "DOB", but not include the date of birth after "alt. DOB". Wonder how could i do it? Thank you.


Solution

  • You could match DOB and use a capturing group for the date part. For the date part, the number of days and the month can be optional followed by matching 4 digits.

    The date part pattern does not validate the date itself, it makes the match a bit more specific.

    \bDOB ((?:(?:3[01]|[12][0-9]|0?[1-9]) [A-Za-z]+ )?\d{4})\b
    

    Explanation

    • \bDOB Match literally preceded by a word boundary
    • ( Capture group 1
      • (?: Non capture group
        • (?:3[01]|[12][0-9]|0?[1-9]) [A-Za-z]+ Match a digit 1-31 and 1+ chars A-Za-z
      • )? Close group and make it optional
      • \d{4} Match 4 digits
    • )\b Close group 1 followed by a word boundary

    Regex demo | Python demo

    For example:

    import re
    
    regex = r"\bDOB ((?:(?:3[01]|[12][0-9]|0?[1-9]) [A-Za-z]+ )?\d{4})\b"
    test_str = ("DOB 23 Jun 1959; alt. DOB 23 Jun 1958\n"
        "DOB 1959; alt. DOB 1958")
    
    print(re.findall(regex, test_str))
    

    Output

    ['23 Jun 1959', '23 Jun 1958', '1959', '1958']