Search code examples
parsinggrammarbisonflex-lexerlex

Flex and Bison - Grammar that sometimes care about spaces


Currently I'm trying to implement a grammar which is very similar to ruby. To keep it simple, the lexer currently ignores space characters.

However, in some cases the space letter makes big difference:

def some_callback(arg=0)
    arg * 100
end

some_callback (1 + 1) + 1  # 300
some_callback(1 + 1) + 1   # 201
some_callback +1           # 100
some_callback+1            # 1
some_callback + 1          # 1

So currently all whitespaces are being ignored by the lexer:

{WHITESPACE} { ; }

And the language says for example something like:

UnaryExpression:
    PostfixExpression
  | T_PLUS UnaryExpression
  | T_MINUS UnaryExpression
  ;

One way I can think of to solve this problem would be to explicitly add whitespaces to the whole grammar, but doing so the whole grammar would increase a lot in complexity:

// OLD:
AdditiveExpression:
    MultiplicativeExpression
  | AdditiveExpression T_ADD MultiplicativeExpression
  | AdditiveExpression T_SUB MultiplicativeExpression
  ;

// NEW:
_:
    /* empty */
  | WHITESPACE _;

AdditiveExpression:
    MultiplicativeExpression
  | AdditiveExpression _ T_ADD _ MultiplicativeExpression
  | AdditiveExpression _ T_SUB _ MultiplicativeExpression
  ;

//...

UnaryExpression:
    PostfixExpression
  | T_PLUS UnaryExpression
  | T_MINUS UnaryExpression
  ;

So I liked to ask whether there is any best practice on how to solve this grammar.

Thank you in advance!


Solution

  • Without having a full specification of the syntax you are trying to parse, it's not easy to give a precise answer. In the following, I'm assuming that those are the only two places where the presence (or absence) of whitespace between two tokens affects the parse.

    Differentiating between f(...) and f (...) occurs in a surprising number of languages. One common strategy is for the lexer to recognize an identifier which is immediately followed by an open parenthesis as a "FUNCTION_CALL" token.

    You'll find that in most awk implementations, for example; in awk, the ambiguity between a function call and concatenation is resolved by requiring that the open parenthesis in a function call immediately follow the identifier. Similarly, the C pre-processor macro definition directive distinguishes between #define foo(A) A (the definition of a macro with arguments) and #define foo (A) (an ordinary macro whose expansion starts with a ( token.

    If you're doing this with (f)lex, you can use the / trailing-context operator:

    [[:alpha:]_][[:alnum:]_]*/'('   { yylval = strdup(yytext); return FUNC_CALL; }
    [[:alpha:]_][[:alnum:]_]*       { yylval = strdup(yytext); return IDENT; }
    

    The grammar is now pretty straight-forward:

    call: FUNC_CALL '(' expression_list ')'   /* foo(1, 2) */
        | IDENT expression_list               /* foo (1, 2) */
        | IDENT                               /* foo * 3 */
    

    This distinction will not be useful in all syntactic contexts, so it will often prove useful to add a non-terminal which will match either identifier form:

    name: IDENT | FUNC_CALL
    

    But you will need to be careful with this non-terminal. In particular, using it as part of the expression grammar could lead to parser conflicts. But in other contexts, it will be fine:

    func_defn: "def" name '(' parameters ')' block "end"
    

    (I'm aware that this is not the precise syntax for Ruby function definitions. It's just for illustrative purposes.)

    More troubling is the other ambiguity, in which it appears that the unary operators + and - should be treated as part of an integer literal in certain circumstances. The behaviour of the Ruby parser suggests that the lexer is combining the sign character with an immediately following number in the case where it might be the first argument to a function. (That is, in the context <identifier><whitespace><sign><digits> where <identifier> is not an already declared local variable.)

    That sort of contextual rule could certainly be added to the lexical scanner using start conditions, although it's more than a little ugly. A not-fully-fleshed out implementation, building on the previous:

    %x SIGNED_NUMBERS
    %%
    
    [[:alpha:]_][[:alnum:]_]*/'('          { yylval.id = strdup(yytext);
                                             return FUNC_CALL; }
    [[:alpha:]_][[:alnum:]_]*/[[:blank:]]  { yylval.id = strdup(yytext);
                                             if (!is_local(yylval.id))
                                                 BEGIN(SIGNED_NUMBERS);
                                             return IDENT;  }
    [[:alpha:]_][[:alnum:]_]*/             { yylval.id = strdup(yytext);
                                             return IDENT;  }
    <SIGNED_NUMBERS>[[:blank:]]+           ;
     /* Numeric patterns, one version for each context */
    <SIGNED_NUMBERS>[+-]?[[:digit:]]+      { yylval.integer = strtol(yytext, NULL, 0);
                                             BEGIN(INITIAL);
                                             return INTEGER; }
    [[:digit:]]+                           { yylval.integer = strtol(yytext, NULL, 0);
                                             return INTEGER; }
    
     /* ... */
     /* If the next character is not a digit or a sign, rescan in INITIAL state */
    <SIGNED_NUMBERS>.|\n                   { yyless(0); BEGIN(INITIAL); }
    

    Another possible solution would be for the lexer to distinguish sign characters which follow a space and are directly followed by a digit, and then let the parser try to figure out whether or not the sign should be combined with the following number. However, this will still depend on being able to distinguish between local variables and other identifiers, which will still require the lexical feedback through the symbol table.

    It's worth noting that the end result of all this complication is a language whose semantics are not very obvious in some corner cases. The fact that f+3 and f +3 produce different results could easily lead to subtle bugs which might be very hard to detect. In many projects using languages with these kinds of ambiguities, the project style guide will prohibit legal constructs with unclear semantics. You might want to take this into account in your language design, if you have not already done so.