Search code examples
bisonjison

Making a parser to ignore a line comment, except a trailing comment


I am using Jison (Javascript version of Bison, very similar).

Objective

I want to parse an input and get valid tokens (IDENTIFIER and trailing comments)

My definition

  • IDENTIFIER
    • A word contains alphabets
  • A line comment
    1. A whole single line which starts with -- and zero or more letters; or
    2. A whole single line which starts with one or more whitespaces (spaces, tabs) followed by -- and zero or more letters
  • A trailing comment
    • A string which starts with -- but it is following a IDENTIFIER For simplicity, let's assume I have an input file to parse:

Input

I want -- not to ignore this
-- This must be ignored

To parse the above, I write the Jison/Bison file (see below) but it is incomplete and only parses non-comment content, i.e.:

Output (current, wrong)

I
want

But, I want a parser to take a trailing comment as a single valid string, i.e.:

Output (expected)

I
want
-- not to ignore this

Given the Jison/Bisnon file as below, how can I modify it as I intended?

Jison/Bison

%lex
%%

\s+         /* skip whitespace */
'--'.*      /* skip comment */
[a-zA-Z]+   return 'IDENTIFIER'
<<EOF>>     return 'EOF'
.           /* skip the others */

/lex

%start expressions
%%

expressions
    : expressions EOF
    | expressions expression
    | expression
        {;}
    ;

expression
    : 'IDENTIFIER'
        {console.log($1); $$ = $1;}
    ;

Solution

  • In (f)lex, I'd just use a rule starting with ^ to force it to only match at the beginning of a line, but jison doesn't interpret carets that way, and in any case you don't really want to limit initial comments to the precise start of the line, but rather to the first non-whitespace token on the line. (It's not clear to me how you want to handle tokens other than identifiers. I've ignored that issue. See the comment start with three question marks.)

    One simple way to do that is with "start conditions" (see example in Jison's documentation), like this: (But please see the note below for a better solution):

    %lex
    %s TRAILING
    %%
    
    \n                  this.begin('INITIAL')
    \s                  /* skip whitespace */
    <TRAILING>'--'.*    return 'IDENTIFIER'; /* or whatever */
    <INITIAL>'--'.*     /* skip initial comments */
    [a-zA-Z]+           this.begin('TRAILING'); return 'IDENTIFIER'
    .                   /* ??? this.begin('TRAILING'); */ /* skip the others */
    <<EOF>>             return 'EOF'
    

    Note: (Added after I remembered this Jison start condition gotcha.) The above code is highly influenced by (f)lex's start condition interface. Both jison and flex (but not lex) allow for a stack of start conditions, which is convenient when contexts nest. As the above example shows, contexts do not always nest; sometimes it is more convenient to just switch from one to another as in a state machine. The BEGIN (f)lex macro does just that; flex also provides yy_push_state and yy_pop_state to use the start condition stack. (In flex, you must specify %option stack for this to work.)

    Jison, for some reason, only provides stack-oriented transitions, and this.begin is just an alias for this.pushStack. Consequently, unlike the flex interface, you really should balance each this.begin with a this.popStack (and maybe change this.begin to this.pushStack in order to be less confusing). The above code, as written, does lots of useless start condition stack pushes and no pops, so it will unnecessarily use a lot of memory particularly on large input.

    The above example could be rewritten to use the stack more frugally, but it would require either duplicating most of the pattern rules (one version for condition INITIAL which pushes when an IDENTIFIER is found but does not pop on newlines, and another version for TRAILING which does not push in IDENTIFIER but does pop in the newline rule). That seems a bit ugly to me, so I provide the following alternative implementation which uses a custom field in the lexer object to store the line number of the previously encountered IDENTIFIER, and only adds comments to the output if they are on the same line:

    %lex
    %%
    
    \s+             /* skip whitespace */
    '--'.*          if (yylloc.first_line == this.last_id_line) return 'IDENTIFIER';
    [a-zA-Z]+       this.last_id_line = yylloc.first_line; return 'IDENTIFIER'
    .               /* skip the others */
    <<EOF>>         return 'EOF'
    

    Adding custom fields to the lexer object must be done with care; there is no documentation defining which names are reserved (or which name prefixes could be used).