Search code examples
pythonpyparsing

pyparsing: Trailing whitespace is not matched but still included for endloc calculation


I am using Python 3.9.5 and pyparsing==3.0.6

I have a scenario where I need to match a word(alphanum) and optionally another word. However, pp.Optional() combined with trailing whitespace leads to incorrect endloc index.

import pyparsing as pp


def first_match(expression, text):
    for m, s, e in expression.scanString(text, maxMatches=1):
        return {'match': m, 'start': s, 'end': e}
    return None


expr = pp.Group(
    pp.Word(pp.alphanums) +
    pp.Optional(
        pp.Word(pp.alphanums)
    )
)

In the test bellow everything is as I expect. The entire string is matched from 0 to 3.

print(first_match(expr, "one"))
# {'match': ParseResults([ParseResults(['one'], {})], {}), 'start': 0, 'end': 3}

If I add trailing spaces, even though they are (correctly) not matched, they are included in the endloc index calculation. So the matched range is 0 to 8:

print(first_match(expr, "one     "))
# {'match': ParseResults([ParseResults(['one'], {})], {}), 'start': 0, 'end': 8}

Shouldn't the endloc index returned by scanString always be the index of the last match character?


Solution

  • This is intended behavior, and is not a regression from pyparsing 2.4.7.

    Think of Optional(expr) as "expr or empty". pyparsing will advance over intervening whitespace, and then, not finding the expression, consider the parse a successful match (since the expr it was looking for was not required).

    To get the behavior you are looking for, you could use this format:

    expr = pp.Group(
        pp.Word(pp.alphanums) + pp.Word(pp.alphanums)
        | pp.Word(pp.alphanums)
    )
    

    If there is a second alpha word present, it will parse them both, and if there is not, then it will just parse the first one.

    Test with and without a second value:

    print(first_match(expr, "one"))
    print(first_match(expr, "one     "))
    print(first_match(expr, "one     two"))
    print(first_match(expr, "one     two    "))
    

    Gives:

    {'match': ParseResults([ParseResults(['one'], {})], {}), 'start': 0, 'end': 3}
    {'match': ParseResults([ParseResults(['one'], {})], {}), 'start': 0, 'end': 3}
    {'match': ParseResults([ParseResults(['one', 'two'], {})], {}), 'start': 0, 'end': 11}
    {'match': ParseResults([ParseResults(['one', 'two'], {})], {}), 'start': 0, 'end': 11}