Search code examples
bisonyacc

maximum munch rule with yacc/bison -- seems to be taking minimum munch


I'm trying to introduce a new language construct into a large/complex grammar. I know it'll make the syntax ambiguous, but I'm hoping to resolve that with the 'maximum munch' rule. That is, put my construct first, so it's taken by preference. I've got some success, but other parts of the tree not so much. Consider:

( 1, 2, 3 )          // triple
( 4, 5 )             // twople
( 6 )                // *not* a oneple, see below
()                   // zerople

(7 + 8) * 9          // single parens used to override usual precedence
( (( 7 + 8 )) * 9 )  // currently this means the same, but nobody would write that!

(( 6 ))              // I want to be a oneple
( ( 6 ) )            // this also a oneple
( ( ( 6 ) ) )        // this also a oneple with a superfluous pair of parens, ambiguous
((  ( 6 )  ))        // superflous parens are innermost
(  (( 6 ))  )        // superfluous parens are outermost

((7 + 8) * (9 + 10)) // not a oneple, despite the (( ... ))

Those three examples with 6 inside three pairs of parens are ambiguous, I'm not bothered which of them the grammar takes, they're semantically the same. By the 'maximum munch' rule, it should take the middle of the three i.e. leaving the innermost parens as superfluous.

The lexer takes each (, ) as a separate token. Currently the parser is taking ( ( ( 6 ) ) ) as equivalent to 6 (where that's parsed as expr/int) -- i.e. what the grammar did before I tried changing it.

The grammar has many levels of tokens defined by other tokens. Some expressions in some contexts are recognising the double-parens OK. Others not so much (and it's hard to boil this down to a reasonable-sized example). Are there any general techniques for persuading bison to take maximum munch?

Addit: This ambiguity is similar to the celebrated ambiguity in the very first language to use BNF: ALGOL 60

if cond1    then 
  if cond2  then blah2
 else            blah3;      // is this on cond2 False or cond1 False?

That was resolved by saying the else attaches to the innermost if/then that hasn't already got an else -- that is, on cond2 False in this case; leaving cond1 without an else branch. (ALGOL didn't have an 'offside rule', thank heavens!)

Addit2: OK, by popular demand, the yacc code before I started amending is here (Now you're going to wish you never asked.) This for example is working in rule aexp (middle line is my mod)

  | '(' exp ')'         {$$ = gc3($2);}
  | '(' '(' exp ')' ')'     {$$ = gc5(buildTuple(cons($3,NIL)));}   /* ADC Apr-2020  */
  | '(' exps2 ')'       {$$ = gc3(buildTuple($2));}

This line is not working -- in rule apat the pat keeps being parsed as a multi-parenthesised ordinary pattern.

apat      : NUMLIT          {$$ = $1;}
  | var             {$$ = $1;}
  | '(' '(' pat ')' ')'     {$$ = gc5(buildTuple(singleton($3)));}  /* ADC Apr-2020 */
  | apat_vI         {$$ = $1;}
  ;

I have tried inserting that line in all sorts of places up and down the tree from pat; I've even tried putting it in multiple productions at the same time, which seems well dodgy. The parser consistently seems to ignore it.

Thank you for @Chris Dodd's answer. I thought that despite this giving shift/reduce errors, I had read on other SO answers that bison 'does the right thing' -- that is, if you put one production above another, it'll prefer that first.


Solution

  • I've fixed this; it's nothing to do with %left / %right precedence.

    Note in the Addit2: code that worked, there's the same non-terminal pat appearing in the two rules that are potentially ambiguous; and the terminals that differentiate them also appear (single pair of parens vs doubled-up pair). Whereas in the problematic example, the 'competing' non-terminals are differently named; and there's no terminals.

    In those rules for apat, NUMLIT cannot start with (, neither can var. apat_vI might start with (, so move this rule for pat into it. (Look in the repo I linked to if you're following along at home.)

    But it's not good enough to just move the rule into apat_vI. Here's the critical few lines:

      | '(' pat_npk ')'     {$$ = gc3($2);}   
      | '(' npk ')'         {$$ = gc3($2);}                  
    /*    | '(' '(' pat ')' ')'     {$$ = gc5(buildTuple(singleton($3)));}      /*   ignored */
      | '(' '(' pat_npk ')' ')'     {$$ = gc5(buildTuple(singleton($3)));}  /* same non-term */
    

    The commented-out line with pat is the one I moved, but it doesn't resolve the ambiguity. (And doesn't decrease the count of shift/reduce conflicts.)

    Looking at the productions from pat, they're npk or pat_npk. In this case, npk can't begin with ( -- so ignore it; pat_npk might. So change the pat to pat_npk, and that decreases the shift/reduce conflicts by 2.

    ( ( ( 6 ) ) ) is now parsed same as (( 6 )), differently to bare 6. Here's a session using the language:

    Main> let foo x = (( x )) in foo 6           -- (( x )) in exp always worked
    (( 6 ))
    Main> let bar (( x )) = x in bar (( 6 ))     -- (( x )) in pat got treated as bare x
    6
    Main> let bar (( x )) = x in bar (( (6) ))   -- three pairs of parens treated as a double
    6
    Main> let bar (( x )) = x in bar (( ((6)) )) -- four pairs of parens treated as two doubles
    (( 6 ))
    

    The critical characteristic is:

    • bison must be able to see there's two productions within the rules for apat_vI that include the same non-terminal pat_npk; and
    • that their context differs by terminals (a pair of parens vs doubled-up).
    • Then upon meeting a ( in the incoming token string, the parser will shift and wait to see if it gets another.

    Just as in the if/else conflict: encourage the parser to shift the first else, and wait to see if it gets another.