I'm looking for words like "one year", "two years", "2-3 years" or "3 - 4 years" in a long string. I've tried to do it using regular expressions. But I'm not sure that I got it when groups are involved.
Let's see what I mean:
import re
text = 'one year, honey 2-5 year, dressed six, ten'
pattern = r'(one|two|three|four|five|six|seven|eight|nine|ten| \
eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen| \
eighteen|nineteen|twenty|[0-9]+[- ]*[0-9]*)[+ ]*year?'
re.findall(pattern, text) # ['one', '2-5']
My problem is that I want ['one year', '2-5 years']
. I'm not sure how to do it. If I forgot about the numbers in words:
pattern = r'[0-9]+[- ]*[0-9]*[\+ ]*year?'
re.findall(pattern, text) # ['2-5 years']
Why I got years
in the second and not in the first? How I can modify it to got years in the first one?
Thanks in advance,
You need to fix the pattern to match the numbers first. Here is an example:
>>> pattern = r'''(?x)\b(?:[0-9]+(?:[- ]*[0-9]+)?|one|two|three|four|five|six|seven|eight|nine|ten
|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty)
[+ ]*years?\b'''
>>> re.findall(pattern, text)
['one year', '2-5 year']
See the Python demo and the regex demo.
Details
(?x)
- re.X
/ re.VERBOSE
inline modifier\b
- a word boundary(?:
- start of a non-capturing group
[0-9]+(?:[- ]*[0-9]+)?
- one or more digits followed with zero or more spaces or -
and then one or more digits|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty
- one of the words in the alternation list)
- end of the non-capturing group[+ ]*
- zero or more +
or spacesyears?
- year
or years
\b
- a word boundary.