Search code examples
regexflex-lexer

Comment pattern match in flex using states


I am trying to match single line comment pattern in flex. Patterns of the comment could be:

//this is a single /(some random stuff) line comment

Or it could be like this:

// this is also a comment\
continuation of the comment from previous line

From the example it's obvious that I have to handle the multi-line case too.

Now my approach was using states. This is what I have so far:

"//"                    {
                            yymore();
                            BEGIN (SINGLE_COMMENT); 
                        }

<SINGLE_COMMENT>([^{NEWLINE}]|\\[(.){NEWLINE}]) {
                                                    yymore();
                                                }           

<SINGLE_COMMENT>([^{NEWLINE}]|[^\\]{NEWLINE})   {
                                                    logout << "Line no " << line_count << ": TOKEN <COMMENT> Lexeme " << string(yytext) << "\nfound\n\n";
                                                    BEGIN (INITIAL);
                                                }

NEWLINE is declared as:

 NEWLINE \r?\n

My declaration unit:

%option noyywrap

%x SINGLE_COMMENT

int line_count = 1;
const int bucketSize = 10; // change if necessary

ofstream logout;
ofstream tokenout;

SymbolTable symbolTable(bucketSize);

Action of NEWLINE:

{NEWLINE}    {
                line_count++;
             }

If I run it with the following input:

// hello\
int main

This is my log file:

Line no 1: TOKEN <COMMENT> Lexeme // hello\

found

Line no 1: TOKEN <INT> Lexeme int found

Line no 1: TOKEN <ID> Lexeme main found


 ScopeTable # 1

 6 --> < main , ID > 

So, it's not catching the multi-line comment. Also the line_count is not incremented. It's staying the same. Can anybody help me figuring out what I have done wrong?

Link to code


Solution

  • In (f)lex, as in most regular expression engines, [ and ] enclose a character class description. A character class is a set of individual characters, and it always matches exactly one character which is a member of that set. There are also negated character classes which are written the same way except that they start with [^ and match exactly one character which is not a member of the set.

    Character classes are not the same as sequences of characters:

    • ab matches an a followed by a b
    • [ab] matches either an a or a b

    Since character classes are just sets of characters, it is meaningless for the individual characters in the class to be repeated or optional, etc. Consequently, almost no regular expression operators (*, +, ?, etc.) are meaningful inside a character class. If you put one of them in a character class expression, it is handled just like an ordinary character:

    • a* matches 0 or more as
    • [a*] matches either an a or a *

    One of the features flex provides which is not provided by most other regular expression systems is macro expansions, of the form {name}. Here the { and } indicate the expansion of a defined macro, whose name is contained between the braces. These characters are also not special inside a character class:

    • {identifier} matches whatever the expanded macro named identifier would match.
    • [{identifier}] matches a single character which is {, } or one of the letters definrt

    Macro definitions seem to be overused by beginners. My advice is always to avoid them, and thereby avoid the confusion which they create.

    It's also worth noting that (f)lex does not have an operator which negates a subpattern. Only character classes can be negated; there is no easy way to write "match anything other than foo". However, you can generally rely on the first longest-match rule to effectively implement negations: if some pattern p executes, then there cannot be any pattern which would match more than p. Thus, it might not be necessary to explicitly write the negation.

    For example, in your comment detector where the only real issue is dealing with carriage return (\r) characters which are not followed by newline characters, you could use (f)lex's pattern matching algorithm to your advantage:

    <SINGLE_COMMENT>{
      [^\\\r\n]+          ;
      \\\r?\n             { ++line_count; }
      \\.                 ; /* only matches if the above rule doesn't */
      \r?\n               { ++line_count; BEGIN(INITIAL); }
      \r                  ; /* only matches if the above rule doesn't */
    }
    

    By the way, it's usually much easier to provide %option yylineno than to try to track newlines manually.