Search code examples
haskellgrammarparser-generatorhappyambiguous-grammar

Shift/reduce conflict in Happy grammar


I've got the following (heavily stripped down) Happy grammar

%token
   '{'   { Langle }
   '}'   { Rangle }
   '..'  { DotDot }
   '::'  { ColonColon }
   '@'   { At }
   mut   { Mut }
   ident { Ident }

 %%

 pattern
   : binding_mode ident at_pat  { error "identifier pattern" }
   | expr_path                  { error "constant expression" }
   | expr_path '{' '..' '}'     { error "struct pattern" }

 binding_mode
   : mut                        { }
   |                            { }

 at_pat
   : '@' pat                    { }
   |                            { }

 expr_path
   : expr_path '::' ident       { }
   | ident                      { }

Which has shift/reduce conflicts around identifiers in patterns. By default, Happy chooses to shift, but in this case that isn't what I want: it tries to shoe-horn everything into constant expression even when it could be an identifier pattern.

I've read that precedence/associativity is the way to solve this sort of problem, but nothing I've added has been able to budge the grammar in the right direction (to be fair, I've been taking mostly shots in the dark).

Using some obvious tokenization, I would like to have:

  • x to yield identifier pattern
  • mut x to yield identifier pattern
  • std::pi to yield constant expression
  • point{..} to yield struct pattern
  • std::point{..} to yield struct pattern

Basically, unless there is a { or :: token waiting to be consumed, an identifier should go to the identifier pattern case.


I apologize if my question is unclear - part of the problem is that I'm having a tough time pinpointing what the problem even is. :(


Solution

  • First, it's important to understand what a shift is. A shift is the result of accepting the next input token, and putting it on the parser stack (where it will eventually be part of a production, but it is not necessary to know which one yet.) A reduction is taking zero or more tokens off the top of the stack which match the right-hand side of some production, and replacing them with the left-hand side.

    When the parser decides to create an identifier pattern out of binding_mode ident at_pat where at_pat is empty, it is not shifting; it is reducing. In fact, it reduces twice: first it reduces zero stacked symbols into an empty at_pat and then it reduces the top three stack symbols into an identifier pattern. Had there been no binding_mode, it could have reduced ident to expr_path and then reduced the expr_path into a constant_expression. So that would be a reduce/reduce conflict.

    But there is another issue, precisely because binding_mode is nullable. When the parser sees an ident, it does not know whether or not a binding_mode would be possible, so it doesn't know whether to reduce an empty binding_mode or to shift the ident. That is a shift/reduce conflict. Since it prefers shift to reduce, it chooses to shift the ident, which means that an empty binding_mode cannot be produced , which in turn precludes the reduce/reduce conflict (and prevents ident @ pat from being recognized at all.)

    So to untangle all of that, we need to start by avoiding the necessity to reduce an empty binding_mode. We do that by the usual nullable-production elimination algorithm, which involves making two copies of the right-hand side, one with the nullable nonterminal, and the other without; we then remove the nullable production. Once we do that, the reduce/reduce conflict appears.

    To avoid the reduce/reduce conflict, we need to be explicit about which production is preferred. Reduce/reduce conflicts cannot be resolved through precedence declarations because the precedence algorithm always involves the comparison between a production (which could be reduced) and a terminal (which could be shifted). So the resolution must be explicit, and that means that we need to say that a bare ident is a pattern, while an expr_path which is not an ident is a constant expression. That leaves us with the following:

    (Note that I've used non-terminals to label the three different productions for pattern, rather than relying on actions. For me that makes it easier to think about, and read.)

    pattern: identifier_pattern | constant_expression | struct_pattern
    

    Here is the null production elimination:

    identifier_pattern:   ident at_pat 
                      |   binding_mode ident at_pat
    

    Here is the explicit prohibition on idents:

    constant_expression:  complex_expr_path 
    
    struct_pattern:       expr_path '{' '..' '}'
    

    binding_mode is no longer nullable:

    binding_mode: mut
    
    at_pat
       : '@' pat
       | %empty
    

    Here we create the two different expr_paths:

    complex_expr_path
       : complex_expr_path '::' ident
       | ident '::' ident
    
    expr_path: ident | complex_expr_path
    

    I hope that solution has some relationship to your original grammar.