I am working on a toy language for fun, (called NOP) using Bison and Flex, and I have hit a wall. I am trying to parse sequences that look like name1.name2
and name1.func1().name2
and I get a lot of reduce/reduce conflicts. I know -why- am getting them, but I am having a heck of a time figuring out what to do about it.
So my question is whether this is a legitimate irregularity that can't be "fixed", or if my grammar is just wrong. The productions in question are compound_name
and compound_symbol
. It seems to me that they should parse separately. If I try to combine them I get conflicts with that as well. In the grammar, I am trying to illustrate what I want to do, rather than anything "clever".
%debug
%defines
%locations
%{
%}
%define parse.error verbose
%locations
%token FPCONST INTCONST UINTCONST STRCONST BOOLCONST
%token SYMBOL
%token AND OR NOT EQ NEQ LTE GTE LT GT
%token ADD_ASSIGN SUB_ASSIGN MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN
%token DICT LIST BOOL STRING FLOAT INT UINT NOTHING
%right ADD_ASSIGN SUB_ASSIGN
%right MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN
%left AND OR
%left EQ NEQ
%left LT GT LTE GTE
%right ':'
%left '+' '-'
%left '*' '/' '%'
%left NEG
%right NOT
%%
program
: {} all_module {}
;
all_module
: module_list
;
module_list
: module_element {}
| module_list module_element {}
;
module_element
: compound_symbol {}
| expression {}
;
compound_name
: SYMBOL {}
| compound_name '.' SYMBOL {}
;
compound_symbol_element
: compound_name {}
| func_call {}
;
compound_symbol
: compound_symbol_element {}
| compound_symbol '.' compound_symbol_element {}
;
func_call
: compound_name '(' expression_list ')' {}
;
formatted_string
: STRCONST {}
| STRCONST '(' expression_list ')' {}
;
type_specifier
: STRING {}
| FLOAT {}
| INT {}
| UINT {}
| BOOL {}
| NOTHING {}
;
constant
: FPCONST {}
| INTCONST {}
| UINTCONST {}
| BOOLCONST {}
| NOTHING {}
;
expression_factor
: constant { }
| compound_symbol { }
| formatted_string {}
;
expression
: expression_factor {}
| expression '+' expression {}
| expression '-' expression {}
| expression '*' expression {}
| expression '/' expression {}
| expression '%' expression {}
| expression EQ expression {}
| expression NEQ expression {}
| expression LT expression {}
| expression GT expression {}
| expression LTE expression {}
| expression GTE expression {}
| expression AND expression {}
| expression OR expression {}
| '-' expression %prec NEG {}
| NOT expression { }
| type_specifier ':' SYMBOL {} // type cast
| '(' expression ')' {}
;
expression_list
: expression {}
| expression_list ',' expression {}
;
%%
This is a very stripped down parser. The "real" one is about 600 lines. It has no conflicts (and passes a bunch of tests) if I don't try to use a function call in a variable name. I am looking at re-writing it to be a packrat grammar if I cannot get Bison to do that I want. The rest of the project is here: https://github.com/chucktilbury/nop
$ bison -tvdo temp.c temp.y
temp.y: warning: 4 shift/reduce conflicts [-Wconflicts-sr]
temp.y: warning: 16 reduce/reduce conflicts [-Wconflicts-rr]
All of the reduce/reduce conflicts are the result of:
module_element
: expression
| compound_symbol
That creates an ambiguity because you also have
expression
: expression_factor
expression_factor
: compound_symbol
So the parser can't tell whether or not you need the unit productions to be reduced. Eliminating module_element: compound_symbol
doesn't change the set of sentences which can be produced; it just requires that a compound_symbol
be reduced through expression
before becoming a module_element
.
As Chris Dodd points out in a comment, the fact that two module_element
s can appear consecutively without a delimiter creates an additional ambiguity: the grammar allows a - b
to be parsed either as a single expression
(and consequently module_element
) or as two consecutive expression
s —a
and -b
— and thus two consecutive module_element
s. That ambiguity accounts for three of the four shift/reduce conflicts.
Both of these are probably errors introduced when you simplified the grammar, since it appears that module elements in the full grammar are definitions, not expressions. Removing modules altogether and using expression
as the starting symbol leaves only a single conflict.
That conflict is indeed the result of an ambiguity between compound_symbol
and compound_name
, as noted in your question. The problem is seen in these productions (non-terminals shortened to make typing easier):
name: SYMBOL
| name '.' SYMBOL
symbol
: element
| symbol '.' element
element
: name
That means that both a
and a.b
are name
s and hence
element
s. But a symbol
is a .
-separated list of element
s, so a.b
could be derived in two ways:
symbol → element symbol → symbol . element
→ name → element . element
→ a.b → name . element
→ a . element
→ a . name
→ a . b
I fixed this by simplifying the grammar to:
compound_symbol
: compound_name
| compound_name '(' expression_list ')'
compound_name
: SYMBOL
| compound_symbol '.' SYMBOL
That gets rid of func_call
and compound_symbol_element
, which as far as I can see serve no purpose. I don't know if the non-terminal names remaining really capture anything sensible; I think it would make more sense to call compound_symbol
something like name_or_call
.
This grammar could be simplified further if higher-order functions were possible; the existing grammar forbids hof()()
, presumably because you don't contemplate allowing a function to return a function object.
But even with higher-order functions, you might want to differentiate between function calls and member access/array subscript, because in many languages a function cannot return an object reference and hence a function call cannot appear on the left-hand side of an assignment operator. In other languages, such as C, the requirement that the left-hand side of an assignment operator be a reference ("lvalue") is enforced outside of the grammar. (And in C++, a function call or even an overloaded operator can return a reference, so the restriction needs to be enforced after type analysis.)