Search code examples
parsingyaccflex-lexerjison

jison grammar definition leads to wrong token recognition


I recently found the project jison and modified the calculator example from its website. (http://zaach.github.io/jison/demos/calc/)

/* lexical grammar */
%lex
%%

"a"                   return 'TOKEN1'
"b"                   return 'TOKEN2'
<<EOF>>               return 'EOF'
.                     return 'INVALID'

/lex

%start letters

%% /* language grammar */

letters
    :
    | letters letter
    ;

letter
    : 'TOKEN1'
    | 'TOKEN2'
    ;

Parsing the string "aaabbbaaba" with a parser generated by the above grammar definition results in

Parse error on line 1:
aaabbbaaba
^
Expecting 'TOKEN1', 'TOKEN2', got 'INVALID'

Unfortunately I don't know why TOKEN1 isn't found correctly. Having token INVALID removed I get the parse error

Unrecognized text.

I found the description of an association error, resulting in a similar error message, on Issue with a Jison Grammar, Strange error from generate dparser but I couldn't find something similar in my code.

What is a solution for this issue?


Solution

  • Good question.

    jison's lexer generator has two modes: the default mode, and a slightly more flex-compatible mode. You select the latter by placing %options flex after the %lex line.

    In default mode:

    1. The first matching pattern wins, even if a later pattern would match a longer token; and

    2. Patterns which end with a letter or digit have an implicit \b added to them, which limits the match to end on a word-boundary.

    In flex mode, patterns are not altered, and the normal flex first-longest rule applies. However, the generated lexer will be slower (see below).

    So in your lexer definition, "a" will not match the first a in the input string, because the generated lexer is actually trying to match a\b -- that is, an a followed by a word boundary.

    You can workaround the issue by simply surrounding the pattern with parentheses:

    ("a")    { return 'TOKEN1'; }
    

    or by using a character class instead of a quoted letter:

    [a]      { return 'TOKEN1'; }
    

    or by adding %options flex to your %lex section.


    By way of an explanation

    jison, unlike flex, does not build a single DFA lexer. Instead, it converts each lex pattern into an anchored javascript regular expression, and at each request for a token, it tries all the patterns until it finds the correct match.

    To implement the flex first-longest match rule, the jison-generated lexer needs to try every regular expression for every token, because it can't know which is the longest match until it tries them all. The "first-match" rule can be quite a bit faster, particularly if common token patterns are put near the beginning of the file.

    Unfortunately, the first-match rule is a lot harder to work with in the common case where a token might be a keyword or an identifier, and identifiers which happen to start with a keyword need to be matched as identifiers. With first-longest matching, it's sufficient to put the keyword first, since an identifier with a keyword prefix will be longer. But with first-match, it's necessary to limit the keywords or identifiers or both to end on a word-boundary.

    Hence the combination of the two rules described above, which means that the normal pattern of listing the keywords before the Identifier pattern will still work. The word-boundary test prevents the keyword patterns from spurious prefix matches.

    But if you have a lot of keywords, that is still a lot of patterns, even though most of them will fail quickly. So rather than using the flex convention:

    "do"                         { return DO; }
    "end"                        { return END; }
    /* ... */
    [[:alpha:]][[:alnum:]_]*     { return "ID"; }
    

    you're a lot better off making keywords represent themselves (as well as other fixed tokens, like operators), because that lets you combine all the keywords and multi-character operator patterns into a single regular expression:

    /* Keywords and multicharacter operators in a single enormous pattern */
    /* For jison mode, I added a manual \b because it won't be added 
     * automatically. In flex mode, that won't hurt, but it could be
     * removed.
     */
    ("do"|"else"|"end"|"if"|"then"|"while")\b|[<>!=]"=" { return yytext; }
    [[:alpha:]][[:alnum:]_]*                        { return "ID"; }
    [[:digit:]]+("."[[:digit:]]*)?                  { return "NUMBER"; }
    [[:space:]]+                                    ;
    /* All single character tokens use a fallback rule */
    .                                               { return yytext; }
    

    The <<EOF>> rule

    Many jison grammars have an explicit <<EOF>> lexical rule, which returns some token like "EOF" or "END_OF_FILE". This token is recognized by an explicit augmented start production, which has return in its action:

    %lex
    %%
    // ...
    <<EOF>> { return "EOF"; }
    /lex
    
    %start input
    %%
    input: start EOF { return $1; }
    
    start: /* The real start token */
    

    This is a jison-specific idiom, which many would consider poor style in flex/bison. It allows the generated grammar to return an actual value, which is the result of the parse.

    Don't just add the <<EOF>> rule to the lexical rules. If you supply your own EOF token, you are responsible for recognizing it in the parser. If the parser has no rule which matches the EOF token, the parse will fail.