I need help defining some rules for a grammar in cups. The rules in question belong to the declaration block, which consists of the declaration of 0 or more constants, 0 or more type records, and 0 or more variables. An example of code to parser:
x: constant := True;
y: constant := 32
type Tpersona is record
dni: Integer;
edad : Integer;
casado : Boolean;
end record;
type Tfecha is record
dia: Integer;
mes : Integer;
anyo : Integer;
end record;
type Tcita is record
usuario:Tpersona;
fecha:Tfecha;
end record;
a: Integer;
x,y: Boolean;
x,y: Boolean;
x,y: Boolean;
The order between them must be respected, but any of them can not appear. This last property is what generates a shift/reduce conflict with the following rules.
declaration_block ::= const_block types_block var_block;
// Constant declaration
const_block ::= dec_const const_block | ;
dec_const ::= IDEN TWOPOINT CONSTANT ASSIGN const_values SEMICOLON;
//Types declaration
types_block ::= dec_type types_block | ;
dec_type ::= TYPE IDEN IS RECORD
reg_list
END RECORD SEMICOLON;
reg_list ::= dec_reg reg_list | dec_reg;
dec_reg ::= IDEN TWOPOINT valid_types SEMICOLON;
//Variable declaration
var_block ::= dec_var var_block | ;
dec_variable ::= iden_list TWOPOINT valid_types SEMICOLON;
iden_list ::= IDEN | IDEN COMMA iden_list;
// common use
const_values ::= INT | booleans;
booleans ::= TRUE | FALSE;
valid_types ::= primitive_types | IDEN;
primitive_types ::= INTEGER | BOOLEAN;
The idea is that any X_block can be empty. I understand the shift-reduce conflict, since when starting and receiving an identifier (IDEN), it doesn't know whether to reduce in const_block ::= <empty>
and take IDEN as part of dec_variable
, or to shift and take the IDEN
token as part of const_block
. If I remove the empty/epsilon production in const_block
or in type_block
, the conflict disappears, although the grammar would be incorrect because it would be an infinite list of constants and it would give a syntax error in the reserved word "type".
So I may have an ambiguity caused because both constants and variables can go at the beginning and start with "id:" and either block can appear first. How can I rewrite the rules to resolve the ambiguities and the shift/reduce conflict they cause?
I tried to do something like:
declaration_block ::= const_block types_block var_block | const_block types_block | const_block var_block | types_block var_block | types_block | var_decl | ;
but i have the same problem.
Other try is to create new_rules to identify if it is a constant or a variable... but the ambiguety of the empty rule in contant_block do not dissapear.
dec_const ::= start_const ASSIGN valor_constantes SEMICOLON;
start_const ::= IDEN TWOPOINT CONSTANT;
// dec_var ::= start_variables SEMICOLON;
// start_var ::= lista_iden TWOPOINT tipos_validos;
If I reduce the problem to something simpler, without taking into account types and only allowing one declaration of a constant or a variable, the fact that these blocks can be empty produces the problem:
dec_var ::= iden_list TWOPOINT valid_types SEMICOLON | ;
iden_list ::= IDEN | IDEN COMMA lista_iden;
I expect rewrite the rules some way to solve this conflict and dealing with similar problemns in the future.
Thanks so much
To start with, your grammar is not ambiguous. But it does have a shift-reduce conflict (in fact, two of them), which indicates that it cannot be parsed deterministically with only one lookahead token.
As it happens, you could solve the problem (more or less) by just increasing the lookahead, if you had a parser generator which allowed you to do that. However, such parser generators are pretty rare, and CUP isn't one of them. There are parser generators which allow arbitrary lookahead, either by backtracking (possibly with memoisation, such as ANTLR4), or by using an algorithm which allows multiple alternatives to be explored in parallel (GLR, for example). But I don't know of a parser generators which can produce a deterministic transition table which uses two lookahead tokens (which would suffice, in this case).
So the solution is to add some apparent redundancy to the grammar in order to factor out the cases which require more than one lookahead token.
The fundamental problem is the following set of possible inputs:
...; a : constant 3 ; ...
...; a : Integer ; ...
There's no ambiguity here whatsoever. The first one can only be a constant declaration; the second can only be variable declarations. But observe that we don't discover that fact until we see either the keyword constant
(as in the first case), or a identifier which could be a type (as in the second case).
What that means is that we need to avoid forcing the parser to make any decision involving the a
and the :
until the next token is available. In particular, we cannot force it to decide whether the a
is just an IDEN
, or the first (or only) element in an iden_list
.
iden_list
is needed to parse
...; a , b : Integer ; ...
but that's not a problem since the ,
is a definite sign that we have a list. So the resolution has to include hamdling a : Integer
without reducing a
to an iden_list
. And that requires an (apparently) redundant production:
var_block::=
| dec_var var_block
dec_var : iden_list ':' type ';'
| IDEN ':' type ';'
iden_list : IDEN ',' IDEN
| iden_list ',' IDEN
(Note: I changed valid_types
to type
because valid
is redundant -- only valid syntaxes are parsed -- and because I think you should never use a plural name for a singular object; it confuses the reader.)
That's not quite enough, though, because we also need to avoid forcing the parser to decide whether the const_block
needs to be reduced before the variable declaration. For that, we need something like the attempt you already made to remove the empty block definitions, and instead provide eight different declaration_block
productions, one of each of the eight possible empty clauses. That will work fine, as long as you change the block definitions to be left-recursive rather than right-recursive. The right-recursive definition forces the parser to perform a reduction at the end of const_block
, which means that it needs to know exactly where const_block
ends with only one lookahead token.
On the whole, if you're going to use a bottom-up parser like CUP, you should make it a habit to use left-recursion unless you have a good reason not to (like defining a right-associative operator). There are a few exceptions, but on the whole left-recursion will produce fewer surprises, and in addition it will not burn through the parser stack on long inputs.
Making all those changes, we end up with something like this, where:
ident_list
was forced to have at least two elements, and a "redundant" production was added for the one-identifier case;declaration_block ::=
| var_block
| types_block
| types_block var_block
| const_block
| const_block var_block
| const_block types_block
| const_block types_block var_block
;
// Constant declaration
const_block ::= dec_const
| const_block dec_const ;
dec_const ::= IDEN TWOPOINT CONSTANT ASSIGN const_value SEMICOLON;
//Types declaration
types_block ::= dec_type
| types_block dec_type ;
dec_type ::= TYPE IDEN IS RECORD
reg_list
END RECORD SEMICOLON;
reg_list ::= dec_reg
| reg_list dec_reg;
dec_reg ::= IDEN TWOPOINT type SEMICOLON;
//Variable declaration
var_block ::= dec_var
| var_block dec_var;
dec_var : iden_list ':' type ';'
| IDEN ':' type ';' ;
iden_list : IDEN ',' IDEN
| iden_list ',' IDEN;
// common use
const_value ::= INT | boolean;
boolean ::= TRUE | FALSE;
type ::= primitive_type | IDEN;
primitive_type ::= INTEGER | BOOLEAN;