I'm trying to write a regular expression that extracts the number of time left for something, which can be days, months or years.
For eg, there is a sentence- "This product has a shell life of 21 days or 21 (twenty-one) days or twenty-one days or 21 months, 21 years or five days or 5 (five) days or five (5) days or five(5) days."
I know this is a funny sentence but the point is I want to extract the durations which is in the above sentence.
I have written a regular expression (?:\w*\-?\w*\s*\(\s*\d+\s*\w*\)\s*\w*|\b\d*\s+\w*\d*)\s*(?:year|month|day)s?
but its not extracting 5 (five) days or the duration which have a digit followed by (words). Can anyone help with the regex ?
Thanks in advance
If you want to match the parts in the example data, you might use
\w+(?:-\w+)?\s*(?:\(\w+(?:-\w+)?\)\s+)?(?:year|month|days)s?\b
The pattern matches:
\w+(?:-\w+)?
Match 1+ word char with an optional - and word chars\s*
Match optional whitespace chars(?:\(\w+(?:-\w+)?\)\s+)?
Optionally match from (
till )
where there can be word chars with an optional -
and word chars in between(?:year|month|days)s?
Match any of the alternatives and an optional s
\b
A word boundary to prevent a partial matchSee a regex demo or a Python demo
Example
import re
from pprint import pprint
regex = r"\w+(?:-\w+)?\s*(?:\(\w+(?:-\w+)?\)\s+)?(?:year|month|days)s?\b"
s = "This product has a shell life of 21 days or 21 (twenty-one) days or twenty-one days or 21 months, 21 years or five days or 5 (five) days or five (5) days or five(5) days."
pprint (re.findall(regex, s))
Output
['21 days',
'21 (twenty-one) days',
'twenty-one days',
'21 months',
'21 years',
'five days',
'5 (five) days',
'five (5) days',
'five(5) days']
Note that \s
could possible also match a newline, and \w
can match digits and chars a-z so there can be a broad range of matches.