Handling new lines in Flex/Bison

I am trying to make a C-like language using Flex/Bison. My problem is that I can't find a proper way to handle new lines. I have to ignore all new lines so I don't returnt them as a token to Bison because that would make the grammar rules so difficult to make but I am asked in some rules to make a mandatory change of line. For example:

Program "identifier" -> mandatory change of line

Function "identifier"("parameters") -> mandatory change of line

If I return \n as a token to flex then i have to put new lines in all of my grammar rules and that's surely not practical. I tried to make a variable work like a switch or something but it didn't quite work. Any help or suggestion?

Solution

If the required newline is simply aesthetic -- that is, if it isn't required in order to avoid an ambiguity -- then the easiest way to enforce it is often just to track token locations (which is something that bison and flex can help you with) so that you can check in your reduction action that two consecutive tokens were not on the same line:

func_defn: "function" IDENT '(' opt_arg_list ')' body "end" {
    if (@5.last_line == @6.first_line) {
      yyerror("Body of function must start on a new line");
      /* YYABORT; */ /* If you want to kill the parse at this point. */
    }
    // ...
}

Bison doesn't require any declarations or options in order to use locations; it will insert location support if it notices that you use @N in any action (which is how you refer to the location of a token). However, it is sometimes useful to insert a %locations declaration to force location support. Normally no other change is necessary to your grammar.

You do have to insert a little bit of code in your lexer in order to report the location values to the parser. Locations are communicated through a global variable called yylloc, whose value is of type YYLTYPE. By default, YYLTYPE is a struct with four int members: first_line, first_column, last_line, last_column. (See the Bison manual for more details.) These fields need to be set in your lexer for every token. Fortunately, flex allows you to define the macro YY_USER_ACTION, which contains code executed just before every action (even empty actions), which you can use to populate yylloc. Here's one which will work for many simple lexical analysers; you can put it in the code block at the top of your flex file.

/* Simple YY_USER_ACTION. Will not work if any action includes
 * yyless(), yymore(), input() or REJECT.
 */
#define YY_USER_ACTION                                             \
  yylloc.first_line = yylloc.last_line;                            \
  yylloc.first_column = yylloc.last_column;                        \
  if (yylloc.last_line == yylineno)                                \
    yylloc.last_column += yyleng;                                  \
  else {                                                           \
    yylloc.last_line = yylineno;                                   \
    yylloc.last_column = yytext + yyleng - strrchr(yytext, '\n');  \
  }

If the simple location check described above isn't sufficient for your use case, then you can do it through what's called "lexical feedback": a mechanism where the parser not only collects information from the lexical scanner, but also communicates back to the lexer when some kind of lexical change is needed.

Lexical feedback is usually discouraged because it can be fragile. It's always important to remember that the parser and the scanner are not necessarily synchronised. The parser often (but not always) needs to know the next token after the current production, so the lexical state when a production's action is being executed might be the lexical state after the next token, rather than the state after the last token in the production. But it might not; many parser generators, including Bison, try to execute an action immediately if they can figure out that the same action will be executed regardless of the next token. Unfortunately, that's not always predictable. In the case of Bison, for example, changing the parsing algorithm from the default LALR(1) to Canonical LR(1) or to GLR can also change a particular reduction action from immediate to deferred.

So if you're going to try to communicate with the scanner, you should try to do so in a way that will work whether or not the scanner has already been asked for the lookahead token. One way to do this is to put the code which communicates with the scanner in a Mid-Rule Action one token earlier than the token which you want to influence. [Note 1]

In order to make newlines "mostly optional", we need to tell the lexer when it should return a newline instead of ignoring it. One way to do this is to export a function which the lexer can call. We put the definition of that function into the generated parser and its declaration into the generated header file:

/* Anything in code requires and code provides sections is also
 * copied into the generated header. So we can use it to declare
 * exported functions.
 */
%code requires {
  #include <stdbool.h>
  bool need_nl(void);
}
%%
// ...

/* See [Note 2], below. */

/* Program directive. */
prog_decl: "program" { need_nl_flag = true; } IDENT '\n'

/* Function definition */
func_defn: "function" IDENT
           '(' opt_arg_list { need_nl_flag = true; } ')' '\n'
              body
           "end"

// ...
%%
static bool need_nl_flag = false;

/* The scanner should call this function when it sees a newline.
 * If the function returns true, the newline should be returned as a token.
 * The function resets the value of the flag, so it must not be called for any
 * other purpose. (This interface allows us to set the flag in a parser action
 * without having to worry about clearing it later.)
 */
bool need_nl(void) {
  bool temp = need_nl_flag;
  need_nl_flag = false;
  return temp;
}

// ...

Then we just need a small adjustment to the scanner in order to call that function. This uses Flex's set difference operator {-} to make a character class containing all whitespace other than a newline. Because we put that rule first, the second rule will only be used for whitespace including at least one newline character. Note that we only return one newline token for any sequence of blank lines.

([[:space:]]{-}[\n])+  { /* ignore whitespace */ }
[[:space:]]+           { if (need_nl()) return '\n'; }

Notes

That's not something you can do without thought, either: it might also be an error to change the scanner configuration too soon. In the action, you can check whether or not the lookahead token has already been read by looking at the value of yychar. If yychar is YYEMPTY, then no lookahead token has been read. If it is YYEOF, then an attempt was made to read a lookahead token but the end of input was encountered. Otherwise, the lookahead token has already been read.

It might seem tempting to use two actions, one before the token prior to the one you want to affect, and one just before that token. The first action could execute only if yychar is not YYEMPTY, indicating that the lookahead token has already been read and the scanner is about to read the token you want to change, while the second action will only execute if yychar at that point is YYEMPTY. But it's entirely possible that for a particular parse both of those conditions are true, or that neither is true.

Bison does have one configuration which you can use to make the lookahead decision completely predictable. If you set %define lr.default-reduction accepting, then Bison will always attempt to read a lookahead symbol, and you can be sure that placing the action one token early will work. Unless you are using the parser interactively, there is no real cost for enabling this option. But it won't work with old Bison versions or with other parser generators such as byacc.
For this grammar, we could have put the mid-rule actions just before the '\n' tokens rather than one token earlier (as long as the parser is never converted to a GLR or Canonical-LR parser). That's because in both rules, the MRA will go in between two tokens, and (presumably) there are no other rules which might apply up to the first of these tokens. Under those circumstances Bison can certainly know that the MRA can be reduced without examining the lookahead token to see if it is \n: either the next token is a newline and the reduction was required, or the next token is not a newline, which will be a syntax error. Since Bison does not guarantee to detect syntax errors before reduction actions are run, it can reduce the MRA action before knowing whether the parse will succeed.