Search code examples
parsinggrammarbisonjison

Jison: Distinguishing between digits and numbers


I have the following minimal example of a grammar I'd like to use with Jison.

/* lexical grammar */
%lex
%%

\s+                   /* skip whitespace */
[0-9]+("."[0-9]+)?\b  return 'NUMBER'
[0-9]                 return 'DIGIT'
[,-]                  return 'SEPARATOR'

// EOF means "end of file"
<<EOF>>               return 'EOF'
.                     return 'INVALID'

/lex

%start expressions

%% /* language grammar */

expressions
    : e SEPARATOR d EOF
        {return $1;}
    ;

d
    : DIGIT
        {$$ = Number(yytext);}
    ;

e
    : NUMBER
        {$$ = Number(yytext);}
    ;

Here I have defined both NUMBER and DIGIT in order to allow for both digits and numbers, depending on the context. What I do not know, is how I define the context. The above example always returns

Expecting 'DIGIT', got 'NUMBER'

when I try to run it in the Jison debugger. How can I define the grammar in order to always expect a digit after a separator? I tried the following which does not work either

/* lexical grammar */
%lex
%%

\s+                   /* skip whitespace */
[,-]                  return 'SEPARATOR'

// EOF means "end of file"
<<EOF>>               return 'EOF'
.                     return 'INVALID'

/lex

%start expressions

%% /* language grammar */

expressions
    : e SEPARATOR d EOF
        {return $1;}
    ;

d
    : [0-9]
        {$$ = Number(yytext);}
    ;

e
    : [0-9]+("."[0-9]+)?\b
        {$$ = Number(yytext);}
    ;

Solution

  • The classic scanner/parser model (originally from lex/yacc, and implemented by jison as well) puts the scanner before the parser. In other words, the scanner is expected to tokenize the input stream without regard to parsing context.

    Most lexical scanner generators, including jison, provide a mechanism for the scanner to adapt to context (see "start conditions"), but the scanner is responsible for tracking context on its own, and that gets quite ugly.

    The simplest solution in this case is to define only a NUMBER token, and have the parser check for validity in the semantic action of rules which actually require a DIGIT. That will work because the difference between DIGIT and NUMBER does not affect the parse other than to make some parses illegal. It would be different if the difference between NUMBER and DIGIT determined which production to use, but that would probably be ambiguous since all digits are actually numbers as well.

    Another solution is to allow either NUMBER or DIGIT where a number is allowed. That would require changing e so that it accepted either NUMBER or DIGIT, and ensuring that DIGIT wins out in the case that both NUMBER and DIGIT are possible. That requires putting its rule earlier in the grammar file, and adding the \b at the end:

    [0-9]\b               return 'DIGIT'
    [0-9]+("."[0-9]+)?\b  return 'NUMBER'