Search code examples
regexflex-lexer

Regular expression error in Flex


I'm learning to write a lexer with Flex and I found a weird problem. I tried to define the regular expression of keyword class.

For my test case:

class T{
}

Expression 1:

ws [\t\r\f\v ]
CLASS (^class$)|(^class{ws})|({ws}class{ws})|({ws}class$)

doesn't work.

Expression 2:

ws [\t\r\f\v ]
CLASS ^class{ws}

works.

Expression 3:

ws [\t\r\f\v ]
CLASS1 ^class{ws}
CLASS {CLASS1}

reports error saying unrecognized rule.

I feel so confused and is there any difference between 2 and 3? I refer to examples on Flex github. It just uses similar expressions to define number.

Any help will be appreciated!

Update:

My script:

%{

/* Some include headers here */

/* The compiler assumes these identifiers. */
#define yylval cool_yylval
#define yylex  cool_yylex

/* Max size of string constants */
#define MAX_STR_CONST 1025
#define YY_NO_UNPUT   /* keep g++ happy */

extern FILE *fin; /* we read from this file */

/* define YY_INPUT so we read from the FILE fin:
 * This change makes it possible to use this scanner in
 * the Cool compiler.
 */
#undef YY_INPUT
#define YY_INPUT(buf,result,max_size) \
        if ( (result = fread( (char*)buf, sizeof(char), max_size, fin)) < 0) \
                YY_FATAL_ERROR( "read() in flex scanner failed");

char string_buf[MAX_STR_CONST]; /* to assemble string constants */
char *string_buf_ptr;
extern int curr_lineno;
extern int verbose_flag;
extern YYSTYPE cool_yylval;

%}

/*
 * Define names for regular expressions here.
 */
ws              [\t\r\f\v ]
CLASS1          {ws}class$
CLASS2          ^class$
CLASS3          ^class{ws}
CLASS4          {ws}class{ws}
CLASS           {CLASS2}
DARROW          =>
STRING          \"[^\n"]+\"

%%
{CLASS} {return (CLASS);}  /*returned CLASS is defined in other header files*/
%%

Solution

  • What you're doing is unnecessary: you can just use class as your regex without any anchors or mentions of white space. You're probably defining it like that because you want to avoid the regex matching part of a larger identifier (e.g. you don't want classOf99 to be interpreted as the keyword class, followed by the identifier Of99), but that's not going to happen: When there are multiple regexen that could match on the current input, Flex will always take the one that leads to the longest match (and in case of ties, it will pick the one that comes first in the flex file). This is known as the maximal munch rule and avoids exactly this issue.

    So your code should just be class { return CLASS; } and none of the definitions are necessary.


    That said, here's why your attempts did not work:

    You can not use ^ and $. Some times this will lead to "unrecognized rules", other times it will just fail to match.

    So that explains why your first definition does not work. But why does the second not work? If you use definitions inside of other definitions, they're implicitly wrapped in parentheses to avoid precedence issues. So you can't use definitions using ^ or $ inside other definitions either.

    So the only way to have multiple alternatives involving ^ and $ would be to have separate rules like this:

    ^re  { return X; }
    re$  { return X; }
    

    But again, this really isn't necessary in your case (or ever).