Search code examples
pythonlr-grammar

Python lrparsing module; cannot parse simple recursive grammar


expr = Ref('expr')
block = '{' + Repeat(expr) + '}'
expr = block | Token(re='[0-9]')
START = expr

Here's a grammar using the lrparsing module for Python. The module reports no conflicts in the grammar.

It fails to parse the string {{0}} with the error lrparsing.ParseError: line 1 column 5: Got '}' when expecting __end_of_input__ while trying to match block in state 11

The stack state step by step is:

shift  '{'; ItemSet:5='{'
shift  '{'; ItemSet:5='{' ItemSet:5='{'
shift  /[0-9]/; ItemSet:4=/[0-9]/ ItemSet:5='{' ItemSet:5='{'
reduce '}'; ItemSet:4=/[0-9]/ -- ItemSet:7=expr ItemSet:5='{' ItemSet:5='{'
reduce '}'; ItemSet:7=expr -- ItemSet:9=expr ItemSet:5='{' ItemSet:5='{'
shift  '}'; ItemSet:11='}' ItemSet:9=expr ItemSet:5='{' ItemSet:5='{'

Which afaik means it's shifting {{0, then upon seeing } reducing 0 to expr, then reducing on } again without having shifted it first, which confuses the bajeezus out of me.

Is this a bug, or am I doing something infinitely simple and stupid?

If it's my grammar, how would I refactor it to satisfy my eternally steamy and passionate desires? If it's a bug, could someone direct me to a python module that has syntax most similar to lrparsing that does work?

EDIT: Refactoring as:

blocks = Ref('blocks')
block = Ref('block')
expr = Ref('expr')
blocks = blocks + block | THIS*0 # THIS*0 is the idiomatic way of getting the empty string
block = '{' + expr + '}'
expr = blocks | Token(re='[0-9]')
START = expr

Allows for proper parsing. The question for me now, is... why? I feel like lrparsing would've complained to me earlier about any parsing problems... Does Repeat simply not work the way I expect it to?


Solution

  • lrparsing has a bug; it does not consider the recursive repeats correctly.

    Your actual problem can be solved by simple recursion, as you did in your extended edit, though with a bit less clutter.

    block = Ref('block')
    block = '{' + block + '}' | Token(re='[0-9]')
    START = block
    

    Note also, that your original grammar would have allowed input such as {{0{1}}}. (The reason being that the repeatable part can be expanded to either simple numbers or expr again.) Considering your second grammar, you probably didn't want that anyways.

    I did get some work done with pyparsing, but the syntax is considerably different. Analogous example:

    from pyparsing import Forward, Literal, nums, oneOf, Word
    
    l = lambda c: Literal(c).suppress()
    block = Forward()
    block << (Word(nums, exact=1) ^ l('{') + block + l('}'))
    print(block.parseString("{{0}}"))
    

    Output:

    ['0']
    

    Hope that helps.