Extract string with regular expression using OR with multiple true values what result it returns?

I could use some explanation about how does str.extract work with regex in python.

for example, I have some strings

6/18/1985 Primary Care Doctor
In 1980, the patient was living in Naples and de
2008 partial thyroidectomy
2/6/96 sleep studyPain Treatment Pain Level

I use the following code to extract the dates in the strings:

str.extract('((\d{1,2}[/]\d{1,2}[/]\d{2,4})|(\d{4}))')

this code works perfectly with my original stings, and output with:

However, my question is since 6/18/1985 technically match my second condition(\d{4}) with a return value of 1985, then why my code still works and return with a value of 6/18/1985?

I think my biggest confusion comes from how does the |(or) operator works in the code where there are multiple true results, and which one should return?

Any thoughts? Many thanks in advance

Solution

Consider this regex matching

import re
>>> re.findall('(\d{1,2}[/]\d{1,2}[/]\d{2,4})|(\d{4})|([P])', "6/18/1985 2234 Primary Care Doctor")
[('6/18/1985', '', ''), ('', '2234', ''), ('', '', 'P')]
    ^^^1st group^^^      ^^^2nd group^^^  ^^^3rd group^^^

As we can see from the above matching, since we have specified 3 matching groups in the regex pattern, the regex engine will try to match every separate group in your target string and return that group if at-least one of the matches is non-empty. Here, from the string "6/18/1985 2234 Primary Care Doctor", each capturing group was able to find at-least one non-empty match, hence returning that group. OR tells the regex to try finding each pattern in the string to find at-least one non-empty match and if so, return the whole group. On the other hand, if we try to match with above pattern in this string

>>> re.findall('(\d{1,2}[/]\d{1,2}[/]\d{2,4})|(\d{4})|([P])', "6/18/1985 Primary Care Doctor")
[('6/18/1985', '', ''), ('', '', 'P')]
   ^^^1st group^^^      ^^^3rd group^^^

We can see that we didn't get any matches for the second pattern (\d{4}) since this pattern doesn't find a single non-empty match in the string (no 4 integers), hence only returning the matches for 1st and 3rd patterns which contain at-least return groups containing non-empty matches.

In your case, the regex was always able to find at-least one non-empty match in each of the pandas' rows of strings like below:

>>> df = pd.Series(["6/18/1985 Primary Care Doctor", "In 1980, the patient was living in Naples and de"])
>>> df.str.extract('(\d{1,2}[/]\d{1,2}[/]\d{2,4})|(\d{4})')
           0     1
0  6/18/1985   NaN
1        NaN  1980

You can see that there are NaN values for the second pattern in first string and for the first pattern in the second string.