Search code examples
pythonregexpandasextract

Extract string with regular expression using OR with multiple true values what result it returns?


I could use some explanation about how does str.extract work with regex in python.

for example, I have some strings

6/18/1985 Primary Care Doctor
In 1980, the patient was living in Naples and de
2008 partial thyroidectomy
2/6/96 sleep studyPain Treatment Pain Level

I use the following code to extract the dates in the strings:

str.extract('((\d{1,2}[/]\d{1,2}[/]\d{2,4})|(\d{4}))')

this code works perfectly with my original stings, and output with:

6/18/1985
1980
2008
2/6/96

However, my question is since 6/18/1985 technically match my second condition(\d{4}) with a return value of 1985, then why my code still works and return with a value of 6/18/1985?

I think my biggest confusion comes from how does the |(or) operator works in the code where there are multiple true results, and which one should return?

Any thoughts? Many thanks in advance


Solution

  • Consider this regex matching

    import re
    >>> re.findall('(\d{1,2}[/]\d{1,2}[/]\d{2,4})|(\d{4})|([P])', "6/18/1985 2234 Primary Care Doctor")
    [('6/18/1985', '', ''), ('', '2234', ''), ('', '', 'P')]
        ^^^1st group^^^      ^^^2nd group^^^  ^^^3rd group^^^
    

    As we can see from the above matching, since we have specified 3 matching groups in the regex pattern, the regex engine will try to match every separate group in your target string and return that group if at-least one of the matches is non-empty. Here, from the string "6/18/1985 2234 Primary Care Doctor", each capturing group was able to find at-least one non-empty match, hence returning that group. OR tells the regex to try finding each pattern in the string to find at-least one non-empty match and if so, return the whole group. On the other hand, if we try to match with above pattern in this string

    >>> re.findall('(\d{1,2}[/]\d{1,2}[/]\d{2,4})|(\d{4})|([P])', "6/18/1985 Primary Care Doctor")
    [('6/18/1985', '', ''), ('', '', 'P')]
       ^^^1st group^^^      ^^^3rd group^^^
    

    We can see that we didn't get any matches for the second pattern (\d{4}) since this pattern doesn't find a single non-empty match in the string (no 4 integers), hence only returning the matches for 1st and 3rd patterns which contain at-least return groups containing non-empty matches.

    In your case, the regex was always able to find at-least one non-empty match in each of the pandas' rows of strings like below:

    >>> df = pd.Series(["6/18/1985 Primary Care Doctor", "In 1980, the patient was living in Naples and de"])
    >>> df.str.extract('(\d{1,2}[/]\d{1,2}[/]\d{2,4})|(\d{4})')
               0     1
    0  6/18/1985   NaN
    1        NaN  1980
    

    You can see that there are NaN values for the second pattern in first string and for the first pattern in the second string.