python parsing programming-languages ply

How would you parse a standard if - else if - else statement? (with RPLY)

I am trying to build a parser with RPLY and am failing at making if - else if -else statements work.

It seems to me as if the parser desperately tries to follow one path and when it fails, instead of looking for another, it just stops.

Here is my current productions/rules:

@self.pg.production('file : ')
@self.pg.production('file : expression_seq')

@self.pg.production('block : INDENT expression_seq DEDENT')

@self.pg.production('expression_seq : expression')
@self.pg.production('expression_seq : expression NEWLINE expression_seq')

@self.pg.production('else_clause : else NEWLINE block')

@self.pg.production('else_if_clause : else_if expression NEWLINE block')

@self.pg.production('else_if_clause_seq : else_if_clause')
@self.pg.production('else_if_clause_seq : else_if_clause NEWLINE else_if_clause_seq')

@self.pg.production('expression : if expression NEWLINE block')
@self.pg.production('expression : if expression NEWLINE block NEWLINE else_if_clause_seq')
@self.pg.production('expression : if expression NEWLINE block NEWLINE else_clause')
@self.pg.production('expression : if expression NEWLINE block NEWLINE else_if_clause_seq NEWLINE else_clause')

@self.pg.production('expression : INTEGER')

@self.pg.production('expression : false')
@self.pg.production('expression : true')

And here is the grammar in EBNF:

file = [ expression_seq ] ;
expression_seq = expression , { NEWLINE , expression } ;
block = INDENT , expression_seq , DEDENT ;
expression = if | INTEGER | 'false' | 'true' ;
if = 'if' , expression , NEWLINE , block , { NEWLINE , else_if_clause_seq } , [ NEWLINE , else_clause ] ;
else_clause = 'else' , block ;
else_if_clause = 'else if' , expression , NEWLINE , block ;
else_if_clause_seq = else_if_clause , { NEWLINE , else_if_clause } ;

So as of now, the parser parses:

if true
  1
else
  1

true

but not:

if true
  1

true
=> rply.errors.ParsingError: (None, SourcePosition(idx=13, lineno=4, colno=1))

if true
  1
else if true
  1
else
  1

true
=> rply.errors.ParsingError: (None, SourcePosition(idx=29, lineno=5, colno=1))

Is there something wrong with my rules? How would you implement such a (common) grammar?

Solution

The problem lies in your handling of NEWLINE tokens. This creates shift/reduce conflict, which are resolved in favour of the shift action. The consequence is that the conflict reduce action can never be taken, which makes certain grammatical constructions impossible to parse.

Here's one example:

else_if_clause_seq: else_if_clause .  [$end, NEWLINE, DEDENT]
                  | else_if_clause . NEWLINE else_if_clause_seq

That's taken from bison's state machine dump for the same grammar. A parser state is a collection of "items"; each item is a production with a marked position. (The mark is the . in the two productions.) The mark basically shows how far the parser has gotten when it reaches that state; if the . is at the end of a production (as in the first line), then a reduction action is possible because the parser has reached the end of the production. If the . has some following symbol, then the parser could shift the next token if the next token could possibly be (or be the first token in some expansion of) the following symbol. In the case of the second production above, a NEWLINE could be shifted if it happened to be the next token.

Productions in the state are also annotated with a lookahead set, although bison only shows the lookahead set for productions which could be reduced. The annotation [$end, NEWLINE, DEDENT] at the end of the first production is that production's lookahead set. In other words, it is the set of possible next tokens in a context in which the production could be reduced.

This state is a shift/reduce conflict because NEWLINE could either trigger a reduction of else_if_clause_seq: else_if_clause, or it could be shifted on the assumption that NEWLINE else_if_clause_seq will be parsed. Since the default resolution of a shift/reduce conflict is to prefer the shift (in bison, ply, rply and most other LR parser generators), the reduction will never take place, forcing the parser to always chose to try to extend else_if_clause_seq. In effect, that means that an else_if_clause not at the end of a block must always be followed by another else_if_clause, making it impossible to parse else_if true 1 else 1 in which the else_if_clause is followed by an else clause.

A parser which could look ahead two tokens wouldn't have any problems with this grammar. The second next token, the one which comes after the NEWLINE, must be either else or else_if; in the first case, a reduction is needed, while in the second case the shift is the correct action. In fact, the NEWLINE really serves no purpose there, since both else and else_if must always be preceded by NEWLINE tokens. Also, since else_if_clause can only end with block and block can only end with DEDENT, we can conclude that the NEWLINE must be preceded by a DEDENT.

It appears that you choose to send NEWLINE after the DEDENT, since your grammar seems to indicate that you send a NEWLINE before an INDENT. That's probably workable in theory but it definitely leads to the shift/reduce conflicts you're experiencing.

The more common implementation of whitespace-aware lexical scanning uses the algorithm outlined in the Python manual: a NEWLINE token is generated when the newline is encountered, unless the surrounding lines are explicitly or implicitly joined, and then the decision is made to issue either one INDENT, one or more DEDENTs, or nothing. Careful examination of the Python grammar shows how this fits together. Here's a simplified extract, in EBNF:

stmt: simple_stmt | compound_stmt
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
small_stmt: expr_stmt …
compound_stmt: if_stmt …
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

suite more or less corresponds to your block but allows unindented single statements on the same line, but note that it starts with a NEWLINE. Simple (non-compound) statements end with a NEWLINE; compound statements are treated as being self-delimiting.

An alternative approach is to only issue NEWLINE tokens in the case where two consecutive lines have exactly the same indentation. As noted above, NEWLINE tokens in lines which are indented or dedented are strictly redundant, since there presence can be deduced; leaving them out entirely reduces the number of tokens which need to be handled by the parser. But if you do that, you can no longer use the simple principle that simple statements are always terminated with a NEWLINE since the last simple statement in a block is directly followed by a DEDENT. That makes it necessary to use a slightly more complicated (and right-recursive) definition of expression_seq:

block              : INDENT statement_sequence DEDENT
statement          : simple_statement | compound_statement
statement_sequence : statement
                   | simple_statement NEWLINE statement_sequence
                   | compound_statement statement_sequence