Search code examples
pythonpython-3.xregexregex-group

Regular expression for extracting no. of days, months and years left


I'm trying to write a regular expression that extracts the number of time left for something, which can be days, months or years.

For eg, there is a sentence- "This product has a shell life of 21 days or 21 (twenty-one) days or twenty-one days or 21 months, 21 years or five days or 5 (five) days or five (5) days or five(5) days."

I know this is a funny sentence but the point is I want to extract the durations which is in the above sentence.

I have written a regular expression (?:\w*\-?\w*\s*\(\s*\d+\s*\w*\)\s*\w*|\b\d*\s+\w*\d*)\s*(?:year|month|day)s? but its not extracting 5 (five) days or the duration which have a digit followed by (words). Can anyone help with the regex ?

Thanks in advance


Solution

  • If you want to match the parts in the example data, you might use

    \w+(?:-\w+)?\s*(?:\(\w+(?:-\w+)?\)\s+)?(?:year|month|days)s?\b
    

    The pattern matches:

    • \w+(?:-\w+)? Match 1+ word char with an optional - and word chars
    • \s* Match optional whitespace chars
    • (?:\(\w+(?:-\w+)?\)\s+)? Optionally match from ( till ) where there can be word chars with an optional - and word chars in between
    • (?:year|month|days)s? Match any of the alternatives and an optional s
    • \b A word boundary to prevent a partial match

    See a regex demo or a Python demo

    Example

    import re
    from pprint import pprint
    
    regex = r"\w+(?:-\w+)?\s*(?:\(\w+(?:-\w+)?\)\s+)?(?:year|month|days)s?\b"
    
    s = "This product has a shell life of 21 days or 21 (twenty-one) days or twenty-one days or 21 months, 21 years or five days or 5 (five) days or five (5) days or five(5) days."
    
    pprint (re.findall(regex, s))
    

    Output

    ['21 days',
     '21 (twenty-one) days',
     'twenty-one days',
     '21 months',
     '21 years',
     'five days',
     '5 (five) days',
     'five (5) days',
     'five(5) days']
    

    Note that \s could possible also match a newline, and \w can match digits and chars a-z so there can be a broad range of matches.