Grammar matching regex character classes, trailing dash

I'm writing my own LALR(1) parser-generator so I'm not sure if I have an issue with my parser-generator or my grammar.

I'm trying to generate a parser for regexes. I have the following rules for character classes (slightly simplified):

LBRACKET: "[" // the [ character
RBRACKET: "]"
DASH: "-"
CHAR: [^[\]] // everything except square brackets

class ::= LBRACKET class_contents RBRACKET

class_contents ::= class_element | class_element class_contents

class_element ::= literal | literal DASH literal

literal ::= DASH | CHAR

I can match regexes such as [a-bc-d], but I cannot match [a-bc-de-] which should correspond to a rule matching the characters a, b, c, d, e, -.

It seems that upon seeing the tokens e (type literal) and - (type DASH), the parser tries to match the rule literal DASH literal. After seeing ] (type RBRACKET), it needs to realize it started the wrong production.

Is this a case where the parser needs 2 lookahead tokens, so LALR(1) is insufficient? In this case, is there a way to rewrite the grammar so that it works? Or is this grammar valid for matching [a-bc-de-] and I should look for a bug in my parser-generator?

Solution

As has been pointed out, your grammar is ambiguous. While it is sometimes possible to resolve ambiguities (which show up as shift/reduce conflicts) using standard heuristics -- "prefer shift to reduce", for example -- this technique is not entirely general, and it is not always easy to understand the consequences of the resolution.

Practical LALR generators do have resolution algorithms, typically based on operator precedence declarations with a fallback to the default algorithms (prefer shift, if there isn't a shift, prefer the first reduction in the grammar). These technique can simplify grammar writing, and sometimes make the grammar more readable and faster to parse. But once you get out of the comfort zone for automatic parser conflict resolution, the shine begins to wear off a bit.

It is not difficult to create an unambiguous grammar, particularly if you start with a precise definition of what allowable sentences are. Here, I'm going to use a simplified version of the Posix standard for regex character classes, in order to focus on the precise issue of allowing the trailing dash as a character. Removed from the standard:

collation elements ([.ä.])
equivalence classes ([=œ=])
standard named character classes ([:alpha:])
negated classes ([^…])

According to Posix, a - in a character class is treated as though it were an ordinary character "if it occurs first (after an initial ^, if any) or last in the list, or as an ending range point in a range expression." (Also, empty character classes are not allowed; if a ] is the first character in the class, it too is treated as an ordinary character.) Below, I don't (yet) implement the first of these three criteria (first in the list), concentrating on the other two. That allows for a very simple grammar, similar to the one provided by Michael Dyck:

/* I use bison syntax in order to be able to easily test the grammars */
class   : '[' contents ']'   /* See Note 1 */
contents: '-'
        | unit
        | unit contents
unit    : CHAR
        | CHAR '-' CHAR      /* See Note 2 */
        | CHAR '-' '-'

As with Michael's grammar, this grammar is unambiguous but LALR(2), which makes it theoretically interesting but practically almost useless since there are no commonly available LALR(2) parser generators. You could parse it with a GLR or Early parser, but it is also possible to mechanically transform an (LA)LR(k) grammar into an (LA)LR(1) grammar [Note 3]. (Michael alludes to this construction, too.)

I've mentioned this fact in a number of SO answers, but this grammar is actually simple enough to make it possible to do the transform by hand, which might make it slightly easier to understand.

The transformation is simple enough. To reduce LR(k) to LR(1), we simply shift every reduction to the right by k-1 tokens. To do that, for each grammar symbol V (both terminals and non-terminals) we create all possible "delayed" grammar symbols. Each such symbol has the form _firstV_follow, where first and follow are sequences of length k-1 from the FIRST_k−1 and FOLLOW_k−1 sets of V [Note 3]. The new symbol _firstV_follow represents an instance of V which appeared k−1 tokens earlier in the input stream, but which can be reduced at this point because there is now enough information to decide.

Obviously, this represents an enormous blow-up in the size of the grammar, although it is more or less manageable for simple grammars with k = 2. In practice, it would be just as manageable to construct the (LA)LR(k) parser. Also, the transformed grammar is far from readable, which is a key feature of grammar-based parser generation. (That wouldn't matter if the transformation were done in the background by a computer program, of course.) But the construction does serve as a proof that every (LA)LR(k) grammar has an equivalent (LA)LR(1) grammar.

The full construction also shows how to undo the transformation during the construction of the parse tree. I haven't seen a description of how to transform semantic actions, but it's not very difficult with yacc/bison. What's needed is to give every (transformed) symbol two attributes (or, in bisonic terms, a struct consisting of two values): one represents the semantic value of the (delayed) symbol being reduced, and the other represents the semantic value of the token which was just shifted.

In a symbol of the form _firstV_follow, the reduce value is the semantic value of V, while the delayed token value is the semantic value of the last token in follow. Yacc/bison implement a rarely-used extension to the $i syntax which allows the semantic action to refer to semantic values which occur earlier in the value stack by using values of i less than 1. Using this mechanism, the token value corresponding to the symbol $i will be found at $(i − (k−1)). (Since i must be a constant, you have to do the subtraction yourself when writing the rule.)

In the example that follows, I don't use reduction values at all; instead, the reduction just prints the value of the reduction. Semantic value references like $0 are the result of applying the above formula. (In this case, k-1 is 1, so $0 refers to the token value of the symbol at position 1 in the right-hand side.)

With that, here's a complete program which you can use to test the grammar:

$ cat charclass.y
%token CHAR
%code {
#include <ctype.h>
#include <stdio.h>
#include <string.h>

int yylex(void) {
  int c;
  do c = getchar(); while (c != '\n' && isspace(c));
  yylval = c;
  switch (c) {
    case EOF: return 0;
    case '\n': case '-': case '[': case ']': return c;
    default: return CHAR;
  }
}
void yyerror(const char* msg) {
  fprintf(stderr, "%s\n", msg);
}
}

%%
input: %empty
     | input class '\n'
     | input '\n'
     | error '\n' { yyerrok; }

/* Original untransformed grammar, for reference */
class: '[' contents ']'
contents: '-' | unit | unit contents
unit    : CHAR | CHAR '-' CHAR | CHAR '-' '-'
*/

class              : '[' OPEN-class-epsi                          { fputc('\n', stderr); }
OPEN-class-epsi    : OPEN-OPEN-DASH DASH-contents-CLOS CLOS-CLOS-epsi
                   | OPEN-OPEN-CHAR CHAR-contents-CLOS CLOS-CLOS-epsi
DASH-contents-CLOS : DASH-DASH-CLOS                               { fprintf(stderr, "CHR(%c) ", $0); }
CHAR-contents-CLOS : CHAR-unit-CLOS
                   | CHAR-unit-DASH DASH-contents-CLOS
                   | CHAR-unit-CHAR CHAR-contents-CLOS
CHAR-unit-CLOS     : CHAR-CHAR-CLOS                               { fprintf(stderr, "CHR(%c) ", $0); }
                   | CHAR-CHAR-DASH DASH-DASH-CHAR CHAR-CHAR-CLOS { fprintf(stderr, "RNG(%c-%c) ", $0, $2); }
                   | CHAR-CHAR-DASH DASH-DASH-DASH DASH-DASH-CLOS { fprintf(stderr, "RNG(%c-%c) ", $0, $2); }
CHAR-unit-DASH     : CHAR-CHAR-DASH                               { $$ = $1; fprintf(stderr, "CHR(%c) ", $0); }
                   | CHAR-CHAR-DASH DASH-DASH-CHAR CHAR-CHAR-DASH { $$ = $3; fprintf(stderr, "RNG(%c-%c) ", $0, $2); }
                   | CHAR-CHAR-DASH DASH-DASH-DASH DASH-DASH-DASH { $$ = $3; fprintf(stderr, "RNG(%c-%c) ", $0, $2); }
CHAR-unit-CHAR     : CHAR-CHAR-CHAR                               { $$ = $1; fprintf(stderr, "CHR(%c) ", $0); }
                   | CHAR-CHAR-DASH DASH-DASH-CHAR CHAR-CHAR-CHAR { $$ = $3; fprintf(stderr, "RNG(%c-%c) ", $0, $2); }
                   | CHAR-CHAR-DASH DASH-DASH-DASH DASH-DASH-CHAR { $$ = $3; fprintf(stderr, "RNG(%c-%c) ", $0, $2); }
CLOS-CLOS-epsi     : %empty
CHAR-CHAR-CHAR     : CHAR
CHAR-CHAR-CLOS     : ']'
CHAR-CHAR-DASH     : '-'
DASH-DASH-CHAR     : CHAR
DASH-DASH-CLOS     : ']'
DASH-DASH-DASH     : '-'
OPEN-OPEN-DASH     : '-'
OPEN-OPEN-CHAR     : CHAR

%%
int main(int argc, char** argv) {
#if YYDEBUG
  if (argc > 1 && strcmp(argv[1], "-d") == 0) yydebug = 1;
#endif
  return yyparse();
}

Here's a short sample run:

$ bison -t -o charclass.c charclass.y && gcc -Wall -std=c11 -o charclass -ggdb charclass.c
$ ./charclass
[abc]
CHR(a) CHR(b) CHR(c) 
[a-bc]
RNG(a-b) CHR(c) 
[ab-c]
CHR(a) RNG(b-c) 
[ab-]
CHR(a) CHR(b) CHR(-) 
[a-b-]
RNG(a-b) CHR(-) 
[a--]
RNG(a--) 
[a---]
RNG(a--) CHR(-) 
[a-b-c]
RNG(a-b) syntax error

Studying the bison trace might or might not help. FWIW, here's one example:

$ ./charclass -d <<< '[a-b-]'
Starting parse
Entering state 0
Reading a token: Next token is token '[' ()
Reducing stack by rule 1 (line 23):
-> $$ = nterm input ()
Stack now 0
Entering state 2
Next token is token '[' ()
Shifting token '[' ()
Entering state 6
Reading a token: Next token is token CHAR ()
Shifting token CHAR ()
Entering state 8
Reducing stack by rule 29 (line 58):
   $1 = token CHAR ()
-> $$ = nterm OPEN-OPEN-CHAR ()
Stack now 0 2 6
Entering state 12
Reading a token: Next token is token '-' ()
Shifting token '-' ()
Entering state 19
Reducing stack by rule 24 (line 53):
   $1 = token '-' ()
-> $$ = nterm CHAR-CHAR-DASH ()
Stack now 0 2 6 12
Entering state 26
Reading a token: Next token is token CHAR ()
Shifting token CHAR ()
Entering state 31
Reducing stack by rule 25 (line 54):
   $1 = token CHAR ()
-> $$ = nterm DASH-DASH-CHAR ()
Stack now 0 2 6 12 26
Entering state 33
Reading a token: Next token is token '-' ()
Shifting token '-' ()
Entering state 19
Reducing stack by rule 24 (line 53):
   $1 = token '-' ()
-> $$ = nterm CHAR-CHAR-DASH ()
Stack now 0 2 6 12 26 33
Entering state 37
Reducing stack by rule 16 (line 45):
   $1 = nterm CHAR-CHAR-DASH ()
   $2 = nterm DASH-DASH-CHAR ()
   $3 = nterm CHAR-CHAR-DASH ()
RNG(a-b) -> $$ = nterm CHAR-unit-DASH ()
Stack now 0 2 6 12
Entering state 22
Reading a token: Next token is token ']' ()
Shifting token ']' ()
Entering state 14
Reducing stack by rule 26 (line 55):
   $1 = token ']' ()
-> $$ = nterm DASH-DASH-CLOS ()
Stack now 0 2 6 12 22
Entering state 16
Reducing stack by rule 8 (line 37):
   $1 = nterm DASH-DASH-CLOS ()
CHR(-) -> $$ = nterm DASH-contents-CLOS ()
Stack now 0 2 6 12 22
Entering state 29
Reducing stack by rule 10 (line 39):
   $1 = nterm CHAR-unit-DASH ()
   $2 = nterm DASH-contents-CLOS ()
-> $$ = nterm CHAR-contents-CLOS ()
Stack now 0 2 6 12
Entering state 20
Reducing stack by rule 21 (line 50):
-> $$ = nterm CLOS-CLOS-epsi ()
Stack now 0 2 6 12 20
Entering state 28
Reducing stack by rule 7 (line 36):
   $1 = nterm OPEN-OPEN-CHAR ()
   $2 = nterm CHAR-contents-CLOS ()
   $3 = nterm CLOS-CLOS-epsi ()
-> $$ = nterm OPEN-class-epsi ()
Stack now 0 2 6
Entering state 10
Reducing stack by rule 5 (line 34):
   $1 = token '[' ()
   $2 = nterm OPEN-class-epsi ()

-> $$ = nterm class ()
Stack now 0 2
Entering state 7
Reading a token: Next token is token '\n' ()
Shifting token '\n' ()
Entering state 13
Reducing stack by rule 2 (line 24):
   $1 = nterm input ()
   $2 = nterm class ()
   $3 = token '\n' ()
-> $$ = nterm input ()
Stack now 0
Entering state 2
Reading a token: Now at end of input.
Shifting token $end ()
Entering state 4
Stack now 0 2 4
Cleanup: popping token $end ()
Cleanup: popping nterm input ()

I have relied on the description of the algorithm in section 6.7 of Parsing Theory by S. Sippu and E: Soisalon-Soininen (Springer-Verlag, 1988) which should be available in any good academic library.

Notes:

bison, like many parser generators, lets you write single-character tokens with single quotes, which makes the grammar a bit more readable, IMHO.
In order to simplify the following step, I avoided defining

any: CHAR | '-'

which could have been used to allow - as the "ending range point" (as in %--). (unit: CHAR '-' any). Instead, I wrote two unit rules, effectively expanding the any production above.
The transformation described below transforms an LR(k) grammar into an LR(1) grammar, or an LALR(k) grammar into an LALR(1) grammar. I use (LA)LR(k) as short-hand to represent these two cases; that's imprecise because the transformation also transforms an SLR(k) grammar into an SLR(1) grammar.
In the example here k-1 is 1 and there are no epsilon productions, so we can simply use the FIRST set for the grammar symbol. But in the general case, it is possible that a given symbol V has a derivation which is shorter than k-1. A more precise formulation is that follow is an element in FOLLOW_k−1(V) and first is an element in FIRST_k−1(V # follow), using # as the concatenation operator.