Search code examples
pandaspython-re

regular expression x.group()


Please advise the step by step that leads to the results which includes below question as well. Thanks!

df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])

  1. What is the transformation of Series.str? I can't examine it.
  2. What is the x in x.groups() and what does the groups() do.
  3. Why the [0] in x.groups()[0][3]?

Given below dataframe, df

0   Monday: The doctor's appointment is at 2:45pm.
1   Tuesday: The dentist's appointment is at 11:30...
2   Wednesday: At 7:00pm, there is a basketball game!
3   Thursday: Be back home by 11:15 pm at the latest.
4   Friday: Take the train at 08:10 am, arrive at ...

This code transform above

to

0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

Solution

  • In complement to @AnuragDabas comment, here is a breakdown of the processing using python's re module:

    >>> import re
    >>> s = "Monday: The doctor's appointment is at 2:45pm."
    
    >>> re.search(r'(\w+day\b)', s) # find any word ending in "day"
    <re.Match object; span=(0, 6), match='Monday'>
    
    >>> re.search(r'(\w+day\b)', s).groups() # get the matching groups
    ('Monday',)
    
    >>> re.search(r'(\w+day\b)', s).groups()[0] # take the first element
    'Monday'
    
    >>> re.search(r'(\w+day\b)', s).groups()[0][:3] # get the first 3 characters
    'Mon'
    

    When used in the context of pandas.Series.str.replace, this passes the lambda to the re.sub function (as defined in the documentation) and uses the output as the replacement of the match (so "ABCDEFday" gets replaced with "ABC") .

    description of the second parameter of .str.replace:

    repl: str or callable
    
        Replacement string or a callable. The callable is passed the regex match object and must return a replacement string to be used. See re.sub().
    

    NB. The regex is flawed in the way that any word ending in day wil be processed. Thus if a line contained for example Saturday: this is my birthday and not a workday!, this would give Sat: this is my bir and not a wor!