Search code examples
pythonregexpython-re

Understanding group and or in regular expression


I'm looking for words like "one year", "two years", "2-3 years" or "3 - 4 years" in a long string. I've tried to do it using regular expressions. But I'm not sure that I got it when groups are involved.

Let's see what I mean:

import re

text = 'one year, honey 2-5 year, dressed six, ten'
pattern = r'(one|two|three|four|five|six|seven|eight|nine|ten| \
                  eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen| \
                  eighteen|nineteen|twenty|[0-9]+[- ]*[0-9]*)[+ ]*year?'

re.findall(pattern, text)  # ['one', '2-5']

My problem is that I want ['one year', '2-5 years']. I'm not sure how to do it. If I forgot about the numbers in words:

pattern = r'[0-9]+[- ]*[0-9]*[\+ ]*year?'
re.findall(pattern, text)  # ['2-5 years']

Why I got years in the second and not in the first? How I can modify it to got years in the first one?

Thanks in advance,


Solution

  • You need to fix the pattern to match the numbers first. Here is an example:

    >>> pattern = r'''(?x)\b(?:[0-9]+(?:[- ]*[0-9]+)?|one|two|three|four|five|six|seven|eight|nine|ten
    |eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty)
    [+ ]*years?\b'''
    >>> re.findall(pattern, text)
    ['one year', '2-5 year']
    

    See the Python demo and the regex demo.

    Details

    • (?x) - re.X / re.VERBOSE inline modifier
    • \b - a word boundary
    • (?: - start of a non-capturing group
      • [0-9]+(?:[- ]*[0-9]+)? - one or more digits followed with zero or more spaces or - and then one or more digits
      • |one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty - one of the words in the alternation list
    • ) - end of the non-capturing group
    • [+ ]* - zero or more + or spaces
    • years? - year or years
    • \b - a word boundary.