I have the following minimal example of a grammar I'd like to use with Jison.
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[0-9]+("."[0-9]+)?\b return 'NUMBER'
[0-9] return 'DIGIT'
[,-] return 'SEPARATOR'
// EOF means "end of file"
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%start expressions
%% /* language grammar */
expressions
: e SEPARATOR d EOF
{return $1;}
;
d
: DIGIT
{$$ = Number(yytext);}
;
e
: NUMBER
{$$ = Number(yytext);}
;
Here I have defined both NUMBER
and DIGIT
in order to allow for both digits and numbers, depending on the context. What I do not know, is how I define the context. The above example always returns
Expecting 'DIGIT', got 'NUMBER'
when I try to run it in the Jison debugger. How can I define the grammar in order to always expect a digit after a separator? I tried the following which does not work either
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[,-] return 'SEPARATOR'
// EOF means "end of file"
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
%start expressions
%% /* language grammar */
expressions
: e SEPARATOR d EOF
{return $1;}
;
d
: [0-9]
{$$ = Number(yytext);}
;
e
: [0-9]+("."[0-9]+)?\b
{$$ = Number(yytext);}
;
The classic scanner/parser model (originally from lex/yacc, and implemented by jison as well) puts the scanner before the parser. In other words, the scanner is expected to tokenize the input stream without regard to parsing context.
Most lexical scanner generators, including jison, provide a mechanism for the scanner to adapt to context (see "start conditions"), but the scanner is responsible for tracking context on its own, and that gets quite ugly.
The simplest solution in this case is to define only a NUMBER
token, and have the parser check for validity in the semantic action of rules which actually require a DIGIT
. That will work because the difference between DIGIT
and NUMBER
does not affect the parse other than to make some parses illegal. It would be different if the difference between NUMBER
and DIGIT
determined which production to use, but that would probably be ambiguous since all digits are actually numbers as well.
Another solution is to allow either NUMBER
or DIGIT
where a number is allowed. That would require changing e
so that it accepted either NUMBER
or DIGIT
, and ensuring that DIGIT
wins out in the case that both NUMBER
and DIGIT
are possible. That requires putting its rule earlier in the grammar file, and adding the \b
at the end:
[0-9]\b return 'DIGIT'
[0-9]+("."[0-9]+)?\b return 'NUMBER'