I am using Python 3.9.5 and pyparsing==3.0.6
I have a scenario where I need to match a word(alphanum) and optionally another word. However, pp.Optional() combined with trailing whitespace leads to incorrect endloc index.
import pyparsing as pp
def first_match(expression, text):
for m, s, e in expression.scanString(text, maxMatches=1):
return {'match': m, 'start': s, 'end': e}
return None
expr = pp.Group(
pp.Word(pp.alphanums) +
pp.Optional(
pp.Word(pp.alphanums)
)
)
In the test bellow everything is as I expect. The entire string is matched from 0 to 3.
print(first_match(expr, "one"))
# {'match': ParseResults([ParseResults(['one'], {})], {}), 'start': 0, 'end': 3}
If I add trailing spaces, even though they are (correctly) not matched, they are included in the endloc index calculation. So the matched range is 0 to 8:
print(first_match(expr, "one "))
# {'match': ParseResults([ParseResults(['one'], {})], {}), 'start': 0, 'end': 8}
Shouldn't the endloc index returned by scanString always be the index of the last match character?
This is intended behavior, and is not a regression from pyparsing 2.4.7.
Think of Optional(expr)
as "expr or empty". pyparsing will advance over intervening whitespace, and then, not finding the expression, consider the parse a successful match (since the expr it was looking for was not required).
To get the behavior you are looking for, you could use this format:
expr = pp.Group(
pp.Word(pp.alphanums) + pp.Word(pp.alphanums)
| pp.Word(pp.alphanums)
)
If there is a second alpha word present, it will parse them both, and if there is not, then it will just parse the first one.
Test with and without a second value:
print(first_match(expr, "one"))
print(first_match(expr, "one "))
print(first_match(expr, "one two"))
print(first_match(expr, "one two "))
Gives:
{'match': ParseResults([ParseResults(['one'], {})], {}), 'start': 0, 'end': 3}
{'match': ParseResults([ParseResults(['one'], {})], {}), 'start': 0, 'end': 3}
{'match': ParseResults([ParseResults(['one', 'two'], {})], {}), 'start': 0, 'end': 11}
{'match': ParseResults([ParseResults(['one', 'two'], {})], {}), 'start': 0, 'end': 11}