Search code examples
parsingcompiler-constructiongrammarlalrambiguous-grammar

Resolving LALR Ambiguities


I've recently wrapped my head around LALR enough to write an LALR generator, and I'm trying to construct a java- or c#-style grammar for it (the beginnings of which are specified here).

I know it's extra effort to write the parser generator, like re-inventing the wheel (why not use Antlr?), but my goal is to bootstrap a hobby OS that can compile itself without depending on a third-party toolchain. My problem though is not with the generator, but with the grammar.

I'm running into reduce/reduce ambiguities with statements and expressions.

I know how to resolve certain types of ambiguities, such as dangling-else, but these few have are not intuitive to me, and they have me stumped.

What's the best way to resolve these? Also, is there a prototyping tool that I can use to help visualize the solution? Or, should I go back to square one and try implementing a GLR parser generator for the grammar?

These statements are legal:

Generic.List<int> myVar1 = x + 4, myVar2; // stmt -> var-decl ;
                                          // var-decl -> type-name var-decl-list

t = 99;                           // simple-stmt -> assign

i++;                              // simple-stmt -> incr-decr
                                  // incr-decr -> primary-expr ++

json.deserialize<list<int>>(obj); // simple-stmt -> call
                                  // call -> primary-expr ( params )
                                  // ...  -> primary-expr . basic-name ( params )
                                  // ...  -> basic-name . basic-name ( params )

Here's how it's set up:

basic-name : ident < type-list >
           | ident

nested-name : nested-name . basic-name
            | basic-name

basic-type : int | bool | ...

type-name : nested-name
          | basic-type

stmt : var-decl ;
     | simple-stmt ;
     | ...

var-decl : type-name var-decl-list

var-decl-list : var-decl-list , var-init
              | var-init

var-init : ident assign-op expression
         | ident

simple-stmt : assign
            | call
            | incr-decr

expr : assign-expr

assign-expr : assign
            | cond-expr

assign : unary-expr assign-op expr

...
rel-expr : rel-expr < shift-expr
         ...
         | shift-expr

...
unary-expr : unary-op primary-expr
           | primary-expr

unary-op : + - ! ~ ++ --  // Prefix operators
         | ( type-name )  // Conversion operator

primary-expr : call
             | primary-expr . basic-name
             | primary-expr ++
             | ( expr )
             ...
             | basic-name

call : primary-expr ( params )

incr-decr : primary-expr ++
          | -- primary-expr
          | ...

So when the parser is expecting a statement, the *LR(k) item set kernel is method-body -> { * stmts-opt } and the full item set for the state looks something like this:

method-body -> { * stmts-opt }
stmts-opt -> * stmts
stmts-opt -> *
stmts -> * stmts stmt
stmt -> * var-decl ;
stmt -> * simple-stmt ;
var-decl -> * type-name var-decl-list
simple-stmt -> * assign
simple-stmt -> * call
simple-stmt -> * incr-decr
type-name -> * nested-name
type-name -> * basic-type
nested-name -> * nested-name . basic-name
nested-name -> * basic-name
basic-name -> * ident < type-list >
basic-name -> * ident
assign -> * unary-expr assign-op expr
unary-expr -> * unary-op primary-expr
unary-expr -> * primary-expr
unary-op -> * ( typename )
unary-op -> * ! | ~ | ...
primary-expr -> * call
primary-expr -> * primary-expr . basic-name
primary-expr -> * primary-expr ++
primary-expr -> * basic-name
primary-expr -> * ( expr )
call -> * primary-expr ( params )
incr-decr -> * primary-expr ++
...

When an identifier is shifted, the next state is:

basic-name -> ident *
basic-name -> ident * < type-list >

which is parsed out or reduced, and brings the next state to:

nested-name -> basic-name *
primary-expr -> basic-name *

Potential conflict. In the parent state, lookaheads don't help, since there is a dot in nested-name and primary-expr. Oh, goody, let's try reducing by nested-name:

type-name -> nested-name *
nested-name -> nested-name * . basic-name

Nothing to see here... Now, how about reducing by primary-expr instead:

unary-expr -> primary-expr *
primary-expr -> primary-expr * . basic-name
primary-expr -> primary-expr * ++
call -> primary-expr * ( params )
incr-decr -> primary-expr * ++
...

Now when we shift ++, we get:

primary-expr -> primary-expr ++ *
incr-decr -> primary-expr ++ *

...another reduce-reduce conflict.

Let's try shifting the ( instead of the ident:

primary-expr -> ( * expr )
unary-op -> ( * type-name )
expr -> * assign-expr
assign-expr -> * assign
assign-expr -> * cond-expr
assign -> * unary-expr assign-op expr
unary-expr -> * unary-op primary-expr
unary-expr -> * primary-expr
unary-op -> * ( typename )
unary-op -> * ! | ~ | ...
primary-expr -> * call
primary-expr -> * primary-expr . basic-name
primary-expr -> * primary-expr ++
primary-expr -> * basic-name
primary-expr -> * ( expr )
call -> * primary-expr ( params )
cond-expr -> * ...
...
rel-expr -> * rel-expr < shift-expr
rel-expr -> * shift-expr
...
type-name -> * nested-name
type-name -> * basic-type
nested-name -> * nested-name . basic-name
nested-name -> * basic-name
basic-name -> * ident < type-list >
basic-name -> * ident

Same issues when you shift an ident or ( onto the stack.

These are just the ones I've run into so far. Since basic-name has precedence over rel-expr, I'm assuming something like x < n would be interpreted as basic-name -> ident < type-list *, and error out if it was actually a relational expression.

My brain has reached the point where I could really use some help.


Solution

  • There are a few questions in your post, which makes it not really ideal for SO. But I'll try to provide some thoughts about each one. As I see it, you have three issues:

    1. Distinguishing expression statements from expressions which are not statements.

    2. Parsing hierarchically-named types in a declaration without conflicting with field-access expressions in expression statements

    3. Distinguishing between the use of < as a comparison operator and as a template bracket.


    1. Distinguishing expression statements from expressions which are not statements.

    As I understand, the desire is to only allow as statements expressions which have (or potentially have) some kind of side-effect: assignments, increment mutations, and subroutine calls. Roughly speaking, this corresponds to Java, whose grammar includes:

    StatementExpression:
      Assignment
      PreIncrementExpression
      PreDecrementExpression
      PostIncrementExpression
      PostDecrementExpression
      MethodInvocation
      ClassInstanceCreationExpression
    

    Each of the alternatives listed for StatementExpression also appears somewhere in the derivation tree for Expression, where they have been factored out of the lists of possibilities. Here's a very condensed sample:

    Expression:
      LambdaExpression
      AssignmentExpression
    
    AssignmentExpression:
      ConditionalExpression
      Assignment
    
    Assignment:
      LeftHandSide AssignmentOperator Expression
    
    ...
    
    UnaryExpression:
      PreIncrementExpression
      + UnaryExpression
      UnaryExpressionNotPlusMinus
    
    PreIncrementExpression:
      ++ UnaryExpression
    
    UnaryExpressionNotPlusMinus:
      PostfixExpression
      ~ UnaryExpression
    
    PostfixExpression:
      Primary
      ExpressionName
      PostIncrementExpression
    
    PostIncrementExpress:
      PostfixExpression ++
    

    Note how the non-terminals used on the right-hand side of ExpressionStatement are special-cased at each precedence level. In the C++ grammar, which doesn't limit which expressions can be statements, there is no need for a separate Assignment non-terminal:

    assignment-expression:
      conditional-expression
      logical-or-expression assignment-operator initializer-clause
    

    But in Java, that won't work. It needs to create the additional non-terminal which does not derive ConditionalExpression, precisely in order to allow Assignment to be a Statement and AssignmentExpression to be an Expression.

    2. Parsing hierarchically-named types in a declaration without conflicting with field-access expressions in expression statements

    Similar to the above, here it is necessary to put hierarchical names (which must start with a basic-name) from other forms of field access expressions, which might start with any (other) primary-expr. The former can be type names or primary expressions; the latter can only be type names. To make this distinction, we need to split the primary-expr production:

    primary-expr : field-access-expr
                 | nested-name
    
    non-field-access-expr:
                   call
                 | post-increment-expression  // was primary-expr ++
                 | ( expr )
                 ...
    
    field-access-expr :
                   non-field-access-expr
                 | field-access-expr . basic-name
    

    3. Distinguishing between the use of < as a comparison operator and as a template bracket.

    Unlike the other two issues, this one might actually be an ambiguity in the language. In C++, for example, template brackets are definitely ambiguous; they can only be resolved by knowing (or being told) whether a particular name is a template name or not. In Java, on the other hand, the ambiguity is solved by sometimes requiring the type parameters to come before the generic name. For example:

    ConstructorDeclarator:
      [TypeParameters] SimpleTypeName ( [FormalParameterList] )
    

    or

    MethodInvocation:
      Primary . [TypeArguments] Identifier ( [ArgumentList] )
    

    In C#, there is yet a different rule, which requires looking at the character following the > which potentially matches the opening <. The algorithm is described in §7.6.4.2 of the C# manual; I have no idea how you would implement that in an LALR(1) parser.