Yacc/Bison for Verilog hierarchical and arrayed identifiers

I'm trying to parse identifiers in the Verilog language. The full grammar is here.

They can have the forms below:

name
name[index]
name[start:stop]
name[index][start:stop]
name.(any of the above)
name[index].(any of the above)
name[index].name[index]... .name[index][start:stop]

Or in EMBF format:

(name ([index])?)+ ([start:stop])?

Here name is a typical identifier as in most programming languages, while index, start and stop are integer.

I'm new to yacc (I'm actually using Jison) but I'm not sure that this can be correctly interpreted at all with the single lookahead token limitation. If name and [ are in the stack, it cannot tell the difference between index and start.

This is my grammar so far:

primary 
: number
| hierarchical_identifier bracketted_range_expression
| hierarchical_identifier
;

primary 
: number
| hierarchical_identifier
| hierarchical_identifier bracketted_range_expression
;

hierarchical_identifier
: IDENTIFIER
| IDENTIFIER '[' UNSIGNED_NUMBER ']'
| hierarchical_identifier '.' IDENTIFIER
| hierarchical_identifier '.' IDENTIFIER '[' UNSIGNED_NUMBER ']'
;

bracketted_range_expression
: '[' range_expression ']';

range_expression
: UNSIGNED_NUMBER ':' UNSIGNED_NUMBER

Which yields several shift/reduce and reduce/reduce errors, and it simply does not want to parse something line foo[1:0]. It expects a ] instead fo the :. Any way to solve this?

Solution

Your analysis is correct. With your grammar, hierarchical_identifier must be reduced before the parser can start scanning bracketed_range_expression, but it cannot know whether the [ following an IDENTIFIER starts a bracketed_range_expression (in which case the reduction is necessary) or is the [ in '[' UNSIGNED_NUMBER ']' (in which case it should be shifted).

With bison, we could solve this using a GLR parser, but with jison we're limited to LALR(1). Fortunately, the language is still LALR(1); all that is needed is to defer the shift/reduce decision by expanding some nonterminals and reducing them later.

Unfortunately, the result is a bit ugly because it requires a certain amount of duplication. Because we need to always be able to shift the [, we end up "denormalizing" the grammar (to borrow a term from database design).

Here's one solution, tested with the jison on-line tool. (The actions were only intended to show that the grammar attaches the range to the entire hierarchical list, and not just to the last identifier in the list.)

/* lexical grammar */
%lex
%%

\s+                   /* skip whitespace */
[0-9]+                return 'NUMBER'
[a-zA-Z][a-zA-Z0-9]*  return 'IDENTIFIER'
.                     return yytext[0]
<<EOF>>               return 'EOF'

/lex

%start expr

%% /* language grammar */

expr   : primary EOF { return $1; }
       ;

primary: NUMBER
       | hierarchical_identifier
       | hierarchical_identifier_with_range
       ;

indexed_identifier
       : IDENTIFIER '[' NUMBER ']' { $$ = { "id": $1, "idx": $3}; } ;

postfix_range
       : '[' NUMBER ':' NUMBER ']' { $$ = [ $2, $4 ]; } ;

hierarchical_identifier
       : IDENTIFIER         { $$ = []; $$.push({ "id": $1}); }
       | indexed_identifier { $$ = [ $1 ]; }
       | hierarchical_identifier '.' IDENTIFIER
                            { $$ = $1; $$.push({ "id": $3}); }
       | hierarchical_identifier '.' indexed_identifier
                            { $$ = $1; $$.push($3); }
       ;

hierarchical_identifier_with_range
       : IDENTIFIER postfix_range
                            { $$ = { "idlist": [{"id": $1}],
                                     "lo": $2[0], "hi": $2[1]}; }
       | indexed_identifier postfix_range
                            { $$ = { "idlist": [$1],
                                     "lo": $2[0], "hi": $2[1]}; }
       | hierarchical_identifier '.' IDENTIFIER postfix_range
                            { $1.push({"id": $3});
                              $$ = { "idlist": $1,
                                     "lo": $4[0], "hi": $4[1]}; }
       | hierarchical_identifier '.' indexed_identifier postfix_range
                            { $1.push($3);
                              $$ = { "idlist": $1,
                                     "lo": $4[0], "hi": $4[1]}; }
       ;

If you eventually plan to use bison, you will probably find that a GLR parser is easier, without adding too much parsing overhead (since your grammar really is unambiguous).