Search code examples
parsingocamllalrmenhir

In Menhir is it possible to make a rule left-associative if it does not have an operator?


I am trying to use Menhir to write a parser for language of regular expressions. My desired grammar, before I modify it to remove ambiguities, looks a bit like the following example. Note that the "sequencing / concatenation" is implicit and there is no token associated with that operation.

%token LPAREN RPAREN
%token CHAR STAR PIPE
%token EOF

%start <unit> parse

%%

parse: re EOF {()}

re:
  | LPAREN re RPAREN {()}  (* Grouping *)
  | CHAR             {()}  (* Single character *)
  | re STAR          {()}  (* Kleene star *)
  | re re            {()}  (* Sequencing / Concatenation *)
  | re PIPE re       {()}  (* Alternation *)

If I had a token for the concatenation, I would be able to remove the ambiguities just by using precedence declarations

%token LPAREN RPAREN
%token CHAR STAR PIPE
%token CONCAT
%token EOF

%left PIPE
%left CONCAT
%nonassoc STAR

%start <unit> parse

%%

parse: re EOF {()}

re:
  | LPAREN re RPAREN   {()}  (* Grouping *)
  | CHAR               {()}  (* Single character *)
  | re STAR            {()}  (* Kleene star *)
  | re CONCAT re       {()}  (* Sequencing / Concatenation *)
  | re PIPE re         {()}  (* Alternation *)

However, I can't get things to work without the CONCAT token in the concatenation rule. I tried using a %prec declaration but there were still some shift/reduce conflicts left.

%token LPAREN RPAREN
%token CHAR STAR PIPE
%token CONCAT
%token EOF

%left PIPE
%left CONCAT
%nonassoc STAR

%start <unit> parse

%%

parse: re EOF {()}

re:
  | LPAREN re RPAREN   {()}  (* Grouping *)
  | CHAR               {()}  (* Single character *)
  | re STAR            {()}  (* Kleene star *)
  | re re %prec CONCAT {()}  (* Sequencing / Concatenation *)
  | re PIPE re         {()}  (* Alternation *)

I think this might be because menhir can't tell that the sequencing is supposed to be left-associative but I'm not 100% sure if that is the cause of the problem.

So far, the only solution I could find involved breaking up the re rule into a bunch of different rules that make the precedence levels and associativities explicit:

%token LPAREN RPAREN
%token CHAR STAR PIPE
%token EOF

%start <unit> parse

%%

parse: re EOF {()}

re: re3 {()}

re0:
  | LPAREN re RPAREN {()}  (* Grouping *)
  | CHAR             {()}  (* Single character *)

re1:
  | re0              {()}
  | re0 STAR         {()}  (* Kleene star *)

re2:
  | re1              {()}
  | re2 re1          {()}  (* Sequencing / Concatenation *)

re3:
  | re2              {()}
  | re3 PIPE re2     {()}  (* Alternation *)

While this last example works fine, I'm really curious if it would be possible to remove all the umbiguities and conflicts just by using precedence and associativity declarations, without needing to rewrite the grammar.


Solution

  • Well, first, this is not exactly a problem on Menhir, but in the kind of grammar that Menhir accepts, which stands in the LR(1) set. If the grammar you provide does not need precedence annotations, the grammar is taken as a SLR(1), a subset of LR(1).

    Your last example works because you have productions for each precedence level (like parsing expression grammars, that are unambiguous by nature), and definitely is not a bad way to write; several modern compilers use this notation to handle more complex languages.

    To understand your problem, let's first ask for Menhir to explain us where the conflicts live:

    menhir --explain parser.mly

    It will generate a parser.conflicts file with an explanation of in which states it can take both actions, reduce and shift:

    ...
    ** In state 8, looking ahead at STAR, shifting is permitted
    ** because of the following sub-derivation:
    
    re re
       re . STAR
    
    ** In state 8, looking ahead at STAR, reducing production
    ** re -> re re
    ** is permitted because of the following sub-derivation:
    
    re STAR // lookahead token appears
    re re .
    
    ** Conflict (shift/reduce) in state 7.
    ** Tokens involved: STAR PIPE LPAREN CHAR
    ** The following explanations concentrate on token STAR.
    ** This state is reached from parse after reading:
    
    re PIPE re
    
    ** The derivations that appear below have the following common factor:
    ** (The question mark symbol (?) represents the spot where the derivations begin to differ.)
    
    parse
    re EOF
    (?)
    

    The grammar is really ambiguous for LR(1), so:

    CHAR CHAR STAR
    

    can be computed both as:

    • (CHAR CHAR) STAR
    • CHAR (CHAR STAR)

    Another way to rewrite your grammar without conflicts is making concatenation possible via list:

    re:
    | term PIPE re
    | term { } (* Left associativity *)
    
    term:
    | list(base STAR* { }) { } (* Concatenation is taken by list *)
    
    base:
    | CHAR
    | LPAREN re RPAREN { }