I could use some explanation about how does str.extract work with regex in python.
for example, I have some strings
6/18/1985 Primary Care Doctor
In 1980, the patient was living in Naples and de
2008 partial thyroidectomy
2/6/96 sleep studyPain Treatment Pain Level
I use the following code to extract the dates in the strings:
str.extract('((\d{1,2}[/]\d{1,2}[/]\d{2,4})|(\d{4}))')
this code works perfectly with my original stings, and output with:
6/18/1985
1980
2008
2/6/96
However, my question is since 6/18/1985
technically match my second condition(\d{4})
with a return value of 1985
, then why my code still works and return with a value of 6/18/1985
?
I think my biggest confusion comes from how does the |
(or) operator works in the code where there are multiple true results, and which one should return?
Any thoughts? Many thanks in advance
Consider this regex matching
import re
>>> re.findall('(\d{1,2}[/]\d{1,2}[/]\d{2,4})|(\d{4})|([P])', "6/18/1985 2234 Primary Care Doctor")
[('6/18/1985', '', ''), ('', '2234', ''), ('', '', 'P')]
^^^1st group^^^ ^^^2nd group^^^ ^^^3rd group^^^
As we can see from the above matching, since we have specified 3 matching groups in the regex pattern, the regex engine will try to match every separate group in your target string and return that group if at-least one of the matches is non-empty. Here, from the string "6/18/1985 2234 Primary Care Doctor"
, each capturing group was able to find at-least one non-empty match, hence returning that group. OR tells the regex to try finding each pattern in the string to find at-least one non-empty match and if so, return the whole group. On the other hand, if we try to match with above pattern in this string
>>> re.findall('(\d{1,2}[/]\d{1,2}[/]\d{2,4})|(\d{4})|([P])', "6/18/1985 Primary Care Doctor")
[('6/18/1985', '', ''), ('', '', 'P')]
^^^1st group^^^ ^^^3rd group^^^
We can see that we didn't get any matches for the second pattern (\d{4})
since this pattern doesn't find a single non-empty match in the string (no 4 integers), hence only returning the matches for 1st and 3rd patterns which contain at-least return groups containing non-empty matches.
In your case, the regex was always able to find at-least one non-empty match in each of the pandas' rows of strings like below:
>>> df = pd.Series(["6/18/1985 Primary Care Doctor", "In 1980, the patient was living in Naples and de"])
>>> df.str.extract('(\d{1,2}[/]\d{1,2}[/]\d{2,4})|(\d{4})')
0 1
0 6/18/1985 NaN
1 NaN 1980
You can see that there are NaN values for the second pattern in first string and for the first pattern in the second string.