Search code examples
pythonregexpython-regex

Failing to match number ranges with pattern declared in DEFINE block using PyPi regex package


I'm using https://github.com/mrabarnett/mrab-regex (via pip install regex, but experiencing a failure here:

pattern_string =  r'''
        (?&N)
        ^ \W*? ENTRY              \W* (?P<entries>    (?&Range)    )     (?&N)

        (?(DEFINE)
             (?P<Decimal>
                 [ ]*? \d+ (?:[.,] \d+)? [ ]*?
             )
             (?P<Range>
                 (?&Decimal) - (?&Decimal) | (?&Decimal)
                 #(?&d) (?: - (?&d))?
             )
             (?P<N>
                 [\s\S]*?
             )
        )
    '''

flags = regex.MULTILINE | regex.VERBOSE  #| regex.DOTALL  | regex.V1 #| regex.IGNORECASE | regex.UNICODE

pattern = regex.compile(pattern_string, flags=flags)

bk2 = f'''
ENTRY: 0.0975 - 0.101
'''.strip()
match = pattern.match('ENTRY: 0.0975 - 0.101')
match.groupdict()

gives:

{'entries': '0.0975', 'Decimal': None, 'Range': None, 'N': None}

It misses the second value.

> pip show regex
Name: regex
Version: 2022.1.18
Summary: Alternative regular expression module, to replace re.
Home-page: https://github.com/mrabarnett/mrab-regex
Author: Matthew Barnett
Author-email: [email protected]
License: Apache Software License
Location: ...
Requires:
Required-by:

> python --version
Python 3.10.0

Solution

  • The problem is that the spaces you defined in the Decimal group pattern are consumed, and the DEFINE patterns are atomic, so although the last [ ]*? part is lazy and can match zero times, once it matches, there is no going back. You can check this if you put the Decimal pattern into an atomic group and compare two patterns, cf. this regex demo and this regex demo. (?mx)^\W*?ENTRY\W*(?P<entries>(?>[ ]*? \d+ (?:[.,] \d+)? [ ]*?) - (?>[ ]*? \d+ (?:[.,] \d+)? [ ]*?) | (?>[ ]*? \d+ (?:[.,] \d+)? [ ]*?)) exposes the same behavior as your regex with DEFINE block, while (?mx)^\W*?ENTRY\W*(?P<entries>[ ]*? \d+ (?:[.,] \d+)? [ ]*? - [ ]*? \d+ (?:[.,] \d+)? [ ]*? | [ ]*? \d+ (?:[.,] \d+)? [ ]*?) finds the match correctly.

    The easiest fix is to move the optional space patterns into the Range group pattern.

    There are other minor enhancements you might want to introduce here:

    • As you are only interested in the captured substring, you do not need to use regex.match with the N group pattern ([\s\S]*?), you may use regex.search and remove the N pattern from the regex
    • You do not need to use a group for a a|a-b like patterns, you can use a more efficient optional non-capturing group approach, a(?:-b)?.

    So, the regex can look like

    ^ \W* ENTRY              \W* (?P<entries>    (?&Range)    ) 
    (?(DEFINE)
        (?P<Decimal>
            \d+ (?:[.,] \d+)?
        )
        (?P<Range>
            (?&Decimal)(?:\ *-\ *(?&Decimal))*
        )
    )
    

    ​ See the regex demo.

    See the Python demo:

    import regex
    pattern_string =  r'''
            ^ \W* ENTRY              \W* (?P<entries>    (?&Range)    )
    
            (?(DEFINE)
                 (?P<Decimal>
                     \d+ (?:[.,] \d+)?
                 )
                 (?P<Range>
                     (?&Decimal)(?:\ *-\ *(?&Decimal))?
                 )
            )
        '''
    
    flags = regex.MULTILINE | regex.VERBOSE
    pattern = regex.compile(pattern_string, flags=flags)
    
    bk2 = f'''
    ENTRY: 0.0975 - 0.101
    '''.strip()
    
    match = pattern.search('ENTRY: 0.0975 - 0.101')
    
    print(match.groupdict())
    

    Output:

    {'entries': '0.0975 - 0.101', 'Decimal': None, 'Range': None}