I am trying to build a parser with RPLY and am failing at making if - else if -else statements work.
It seems to me as if the parser desperately tries to follow one path and when it fails, instead of looking for another, it just stops.
Here is my current productions/rules:
@self.pg.production('file : ')
@self.pg.production('file : expression_seq')
@self.pg.production('block : INDENT expression_seq DEDENT')
@self.pg.production('expression_seq : expression')
@self.pg.production('expression_seq : expression NEWLINE expression_seq')
@self.pg.production('else_clause : else NEWLINE block')
@self.pg.production('else_if_clause : else_if expression NEWLINE block')
@self.pg.production('else_if_clause_seq : else_if_clause')
@self.pg.production('else_if_clause_seq : else_if_clause NEWLINE else_if_clause_seq')
@self.pg.production('expression : if expression NEWLINE block')
@self.pg.production('expression : if expression NEWLINE block NEWLINE else_if_clause_seq')
@self.pg.production('expression : if expression NEWLINE block NEWLINE else_clause')
@self.pg.production('expression : if expression NEWLINE block NEWLINE else_if_clause_seq NEWLINE else_clause')
@self.pg.production('expression : INTEGER')
@self.pg.production('expression : false')
@self.pg.production('expression : true')
And here is the grammar in EBNF:
file = [ expression_seq ] ;
expression_seq = expression , { NEWLINE , expression } ;
block = INDENT , expression_seq , DEDENT ;
expression = if | INTEGER | 'false' | 'true' ;
if = 'if' , expression , NEWLINE , block , { NEWLINE , else_if_clause_seq } , [ NEWLINE , else_clause ] ;
else_clause = 'else' , block ;
else_if_clause = 'else if' , expression , NEWLINE , block ;
else_if_clause_seq = else_if_clause , { NEWLINE , else_if_clause } ;
So as of now, the parser parses:
if true
1
else
1
true
but not:
if true
1
true
=> rply.errors.ParsingError: (None, SourcePosition(idx=13, lineno=4, colno=1))
or
if true
1
else if true
1
else
1
true
=> rply.errors.ParsingError: (None, SourcePosition(idx=29, lineno=5, colno=1))
Is there something wrong with my rules? How would you implement such a (common) grammar?
The problem lies in your handling of NEWLINE
tokens. This creates shift/reduce conflict, which are resolved in favour of the shift action. The consequence is that the conflict reduce action can never be taken, which makes certain grammatical constructions impossible to parse.
Here's one example:
else_if_clause_seq: else_if_clause . [$end, NEWLINE, DEDENT]
| else_if_clause . NEWLINE else_if_clause_seq
That's taken from bison's state machine dump for the same grammar. A parser state is a collection of "items"; each item is a production with a marked position. (The mark is the .
in the two productions.) The mark basically shows how far the parser has gotten when it reaches that state; if the .
is at the end of a production (as in the first line), then a reduction action is possible because the parser has reached the end of the production. If the .
has some following symbol, then the parser could shift the next token if the next token could possibly be (or be the first token in some expansion of) the following symbol. In the case of the second production above, a NEWLINE
could be shifted if it happened to be the next token.
Productions in the state are also annotated with a lookahead set, although bison only shows the lookahead set for productions which could be reduced. The annotation [$end, NEWLINE, DEDENT]
at the end of the first production is that production's lookahead set. In other words, it is the set of possible next tokens in a context in which the production could be reduced.
This state is a shift/reduce conflict because NEWLINE
could either trigger a reduction of else_if_clause_seq: else_if_clause
, or it could be shifted on the assumption that NEWLINE else_if_clause_seq
will be parsed. Since the default resolution of a shift/reduce conflict is to prefer the shift (in bison, ply, rply and most other LR parser generators), the reduction will never take place, forcing the parser to always chose to try to extend else_if_clause_seq
. In effect, that means that an else_if_clause
not at the end of a block must always be followed by another else_if_clause
, making it impossible to parse else_if true 1 else 1
in which the else_if_clause
is followed by an else
clause.
A parser which could look ahead two tokens wouldn't have any problems with this grammar. The second next token, the one which comes after the NEWLINE
, must be either else
or else_if
; in the first case, a reduction is needed, while in the second case the shift is the correct action. In fact, the NEWLINE
really serves no purpose there, since both else
and else_if
must always be preceded by NEWLINE
tokens. Also, since else_if_clause
can only end with block
and block
can only end with DEDENT
, we can conclude that the NEWLINE
must be preceded by a DEDENT
.
It appears that you choose to send NEWLINE
after the DEDENT
, since your grammar seems to indicate that you send a NEWLINE
before an INDENT
. That's probably workable in theory but it definitely leads to the shift/reduce conflicts you're experiencing.
The more common implementation of whitespace-aware lexical scanning uses the algorithm outlined in the Python manual: a NEWLINE
token is generated when the newline is encountered, unless the surrounding lines are explicitly or implicitly joined, and then the decision is made to issue either one INDENT
, one or more DEDENT
s, or nothing. Careful examination of the Python grammar shows how this fits together. Here's a simplified extract, in EBNF:
stmt: simple_stmt | compound_stmt
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
small_stmt: expr_stmt …
compound_stmt: if_stmt …
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
suite
more or less corresponds to your block
but allows unindented single statements on the same line, but note that it starts with a NEWLINE
. Simple (non-compound) statements end with a NEWLINE
; compound statements are treated as being self-delimiting.
An alternative approach is to only issue NEWLINE
tokens in the case where two consecutive lines have exactly the same indentation. As noted above, NEWLINE
tokens in lines which are indented or dedented are strictly redundant, since there presence can be deduced; leaving them out entirely reduces the number of tokens which need to be handled by the parser. But if you do that, you can no longer use the simple principle that simple statements are always terminated with a NEWLINE
since the last simple statement in a block
is directly followed by a DEDENT
. That makes it necessary to use a slightly more complicated (and right-recursive) definition of expression_seq
:
block : INDENT statement_sequence DEDENT
statement : simple_statement | compound_statement
statement_sequence : statement
| simple_statement NEWLINE statement_sequence
| compound_statement statement_sequence