Search code examples
pythonparsinggrammarpyparsingbnf

How to parse grammar `(a | b)* a`


Let's consider the grammar described by the following Backus-Naur form:

a ::= 'a'
b ::= 'b'
grammar ::= (a | b)* a

I'm trying to parse it using pyparsing and came to the following implementation

a = Literal('a')
b = Literal('b')
grammar = (a | b)[...] + 'a'

However, it fails to parse any of the string described by the grammar, with for example grammar.parseString('aba') raising

ParseException: Expected "a", found end of text  (at char 3), (line:1, col:4)

It seems to be caused by the fact that the [...] expression is parsed by consuming tokens until it is not possible to do it anymore. Then, there is no token left to be parsed by the last literal.

A way to do this would be to use the FollowedBy class:

grammar = ((a | b) + FollowedBy(a | b))[...] + a

which works. However it is highly inelegant, doesn't seem to be efficient, and is not really versatile.

Is there a better way to parse this grammar with pyparsing ?


Solution

  • No you are exactly right, pyparsing does not do backtracking like you might from a regex like "[ab]*a". Pyparsing does no lookahead of any kind unless you explicitly implement it.

    Here is an expanded version of your original parser, with added setName and setDebug calls:

    a = Literal('a').setName("A").setDebug()
    b = Literal('b').setName("B").setDebug()
    grammar = (a | b)[...] + a().setName("trailing_a").setDebug()
    
    grammar.runTests("""\
        aba
        """)
    

    When parsing "aba", here is the debug output:

    Match A at loc 0(1,1)
    Matched A -> ['a']
    Match A at loc 1(1,2)
    Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
    Match B at loc 1(1,2)
    Matched B -> ['b']
    Match A at loc 2(1,3)
    Matched A -> ['a']
    Match A at loc 3(1,4)
    Exception raised:Expected A, found end of text  (at char 3), (line:1, col:4)
    Match B at loc 3(1,4)
    Exception raised:Expected B, found end of text  (at char 3), (line:1, col:4)
    Match trailing_a at loc 3(1,4)
    Exception raised:Expected trailing_a, found end of text  (at char 3), (line:1, col:4)
    
    aba
       ^
    FAIL: Expected trailing_a, found end of text  (at char 3), (line:1, col:4
    

    and you can see the trailing_a getting matched as part of the initial repetition, not as a trailing_a. Since there is no actual trailing 'a' now, the parse fails.

    You can either define a special form of a for the leading repetition (as you have done in one line) like this in two lines:

    leading_a = a + FollowedBy(a | b)
    grammar = (leading_a | b)[...] + 'a'
    

    With debug output, we can follow the parser logic:

    Match leading_a at loc 0(1,1)
    Match A at loc 0(1,1)
    Matched A -> ['a']
    Match A at loc 1(1,2)
    Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
    Match B at loc 1(1,2)
    Matched B -> ['b']
    Matched leading_a -> ['a']
    Match leading_a at loc 1(1,2)
    Match A at loc 1(1,2)
    Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
    Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
    Match B at loc 1(1,2)
    Matched B -> ['b']
    Match leading_a at loc 2(1,3)
    Match A at loc 2(1,3)
    Matched A -> ['a']
    Match A at loc 3(1,4)
    Exception raised:Expected A, found end of text  (at char 3), (line:1, col:4)
    Match B at loc 3(1,4)
    Exception raised:Expected B, found end of text  (at char 3), (line:1, col:4)
    Exception raised:Expected {A | B}, found end of text  (at char 3), (line:1, col:4)
    Match B at loc 2(1,3)
    Exception raised:Expected B, found 'a'  (at char 2), (line:1, col:3)
    
    aba
    ['a', 'b', 'a']
    

    Or define a special trailing_a, and use the stopOn argument to ZeroOrMore:

    trailing_a = a + ~FollowedBy(a | b)
    grammar = OneOrMore(a | b, stopOn=trailing_a) + 'a'
    

    to get similar results.

    EDIT Changing grammar to just (a | b)[...] shows this debugging output:

    Match A at loc 0(1,1)
    Matched A -> ['a']
    Match A at loc 1(1,2)
    Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
    Match B at loc 1(1,2)
    Matched B -> ['b']
    Match A at loc 2(1,3)
    Matched A -> ['a']
    Match A at loc 3(1,4)
    Exception raised:Expected A, found end of text  (at char 3), (line:1, col:4)
    Match B at loc 3(1,4)
    Exception raised:Expected B, found end of text  (at char 3), (line:1, col:4)
    
    aba
    ['a', 'b', 'a']
    

    So, yes, lookahead does incur a performance penalty.

    pyparsing includes an internal cacheing capability, also known as "packrat parsing". Here is the debug output, with cached values marked with '*':

      Match trailing_a at loc 0(1,1)
      Match A at loc 0(1,1)
      Matched A -> ['a']
      Match A at loc 1(1,2)
      Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
      Match B at loc 1(1,2)
      Matched B -> ['b']
      Exception raised:Found unwanted token, FollowedBy:({A | B}), found 'b'  (at char 1), (line:1, col:2)
    * Match A at loc 0(1,1)
    * Matched A -> ['a']
      Match trailing_a at loc 1(1,2)
    * Match A at loc 1(1,2)
    * Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
    * Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
    * Match A at loc 1(1,2)
    * Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
    * Match B at loc 1(1,2)
    * Matched B -> ['b']
      Match trailing_a at loc 2(1,3)
      Match A at loc 2(1,3)
    * Matched A -> ['a']
      Match A at loc 3(1,4)
      Exception raised:Expected A, found end of text  (at char 3), (line:1, col:4)
      Match B at loc 3(1,4)
      Exception raised:Expected B, found end of text  (at char 3), (line:1, col:4)
      Matched trailing_a -> ['a']
    * Match trailing_a at loc 2(1,3)
    * Match A at loc 2(1,3)
    * Matched A -> ['a']
    * Matched trailing_a -> ['a']
    

    Summary of "Match" operations:

    • Without lookahead: 6
    • With lookahead: 12
    • With lookahead+packrat: 9

    On a final note, it is possible to inject additional logic at parse time using parse actions. Parse actions can be written as a method (which can return a modified set of tokens, or raise an exception) or as a predicate function (which returns True or False, and pyparsing will raise an exception in the event of a False return).

    So you could write your grammar using the fastest no-lookahead form, and then add a validating condition to be run afterward:

    grammar = (a | b)[...]
    grammar.addCondition(lambda t: t[-1] == 'a', message="string does not end with 'a'")
    

    It is very likely that the savings in parse time would be enough to offset the added cost of doing the separate condition evaluation.