Search code examples
cflex-lexerlex

Dynamically turning on and off flex tokens


I have a program that should lex its input depending on a command-line argument. So, the requirement is that 1/2 is lexed as:

NUMBER
SLASH
NUMBER

...when given one command-line argument, and lexed as:

FREEFORM_TOKEN

...when given another command line argument. The tool I'm using is flex.

I'm wondering whether flex can support this use case. The rules I'm having are:

[0-9]+(.[0-9]+)?([eE][-+]?[0-9]+)? {
  yylval->d = atof(yytext);
  return NUMBER;
}

[A-Za-z0-9_.]([A-Za-z0-9_./]*[A-Za-z0-9_.])? {
  yylval->s = strdup(yytext);
  return FREEFORM_TOKEN;
}

Can I simply dynamically turn on/off then token by an if statement, like this:

[0-9]+(.[0-9]+)?([eE][-+]?[0-9]+)? {
  yylval->d = atof(yytext);
  return NUMBER;
}

[A-Za-z0-9_.]([A-Za-z0-9_./]*[A-Za-z0-9_.])? {
  if (cmd_line_argument_given)
  {
    yylval->s = strdup(yytext);
    return FREEFORM_TOKEN;
  }
}

...or is there some issue in having the long regexp in the .l file, which would cause 1/2 to match but not return anything?

How in practice should I implement this requirement?

Should I do this instead:

<INITIAL,CMDOPT>[0-9]+(.[0-9]+)?([eE][-+]?[0-9]+)? {
  yylval->d = atof(yytext);
  return NUMBER;
}

<CMDOPT>[A-Za-z0-9_.]([A-Za-z0-9_./]*[A-Za-z0-9_.])? {
  yylval->s = strdup(yytext);
  return FREEFORM_TOKEN;
}

...and then do BEGIN(CMDOPT) if I want to turn on FREEFORM_TOKEN and just leave it in the INITIAL state if I want to turn off FREEFORM_TOKEN? Then all rules would have both INITIAL and CMDOPT state with the exception of FREEFORM_TOKEN that would have only CMDOPT state.


Solution

  • I'm wondering whether flex can support this use case.

    Yes, in various ways.

    Can I simply dynamically turn on/off then token by an if statement, like this:

    No, that is not (quite) one of the ways. If you simply make it conditional whether any action is taken, then when no action is taken, the token will be silently consumed, instead of being matched against a different rule.

    To instead make flex fall back to a different rule in that case, you would use the REJECT() directive. This instructs flex to instead apply the next-best rule matching the input (or a prefix of it).

    Note well that REJECT appearing at all in your scanner definition makes the whole scanner substantially slower. This is the single worst thing you can do for scanner performance. But that might not be a problem for you in practice.

    How in practice should I implement this requirement?

    Should I [use start conditions] instead [?]

    Start conditions are usually the better option for selecting among different rule subsets. As @Cheatah observed first, you can use yy_push_state() to set the start condition appropriately before scanning begins. This would be my recommendation.

    If you do use start conditions, then you can make your rules simpler by using two of them, one for each syntax option, and making them inclusive ones. Then all the rules that you do not mark with any start condition will apply to both, and you need mark only those that are specific to one start condition or the other.