I am trying to match single line comment pattern in flex. Patterns of the comment could be:
//this is a single /(some random stuff) line comment
Or it could be like this:
// this is also a comment\
continuation of the comment from previous line
From the example it's obvious that I have to handle the multi-line case too.
Now my approach was using states. This is what I have so far:
"//" {
yymore();
BEGIN (SINGLE_COMMENT);
}
<SINGLE_COMMENT>([^{NEWLINE}]|\\[(.){NEWLINE}]) {
yymore();
}
<SINGLE_COMMENT>([^{NEWLINE}]|[^\\]{NEWLINE}) {
logout << "Line no " << line_count << ": TOKEN <COMMENT> Lexeme " << string(yytext) << "\nfound\n\n";
BEGIN (INITIAL);
}
NEWLINE is declared as:
NEWLINE \r?\n
My declaration unit:
%option noyywrap
%x SINGLE_COMMENT
int line_count = 1;
const int bucketSize = 10; // change if necessary
ofstream logout;
ofstream tokenout;
SymbolTable symbolTable(bucketSize);
Action of NEWLINE:
{NEWLINE} {
line_count++;
}
If I run it with the following input:
// hello\
int main
This is my log file:
Line no 1: TOKEN <COMMENT> Lexeme // hello\
found
Line no 1: TOKEN <INT> Lexeme int found
Line no 1: TOKEN <ID> Lexeme main found
ScopeTable # 1
6 --> < main , ID >
So, it's not catching the multi-line comment. Also the line_count is not incremented. It's staying the same. Can anybody help me figuring out what I have done wrong?
In (f)lex, as in most regular expression engines, [
and ]
enclose a character class description. A character class is a set of individual characters, and it always matches exactly one character which is a member of that set. There are also negated character classes which are written the same way except that they start with [^
and match exactly one character which is not a member of the set.
Character classes are not the same as sequences of characters:
ab
matches an a
followed by a b
[ab]
matches either an a
or a b
Since character classes are just sets of characters, it is meaningless for the individual characters in the class to be repeated or optional, etc. Consequently, almost no regular expression operators (*
, +
, ?
, etc.) are meaningful inside a character class. If you put one of them in a character class expression, it is handled just like an ordinary character:
a*
matches 0 or more a
s[a*]
matches either an a
or a *
One of the features flex provides which is not provided by most other regular expression systems is macro expansions, of the form {name}
. Here the {
and }
indicate the expansion of a defined macro, whose name is contained between the braces. These characters are also not special inside a character class:
{identifier}
matches whatever the expanded macro named identifier
would match.[{identifier}]
matches a single character which is {
, }
or one of the letters definrt
Macro definitions seem to be overused by beginners. My advice is always to avoid them, and thereby avoid the confusion which they create.
It's also worth noting that (f)lex does not have an operator which negates a subpattern. Only character classes can be negated; there is no easy way to write "match anything other than foo
". However, you can generally rely on the first longest-match rule to effectively implement negations: if some pattern p
executes, then there cannot be any pattern which would match more than p
. Thus, it might not be necessary to explicitly write the negation.
For example, in your comment detector where the only real issue is dealing with carriage return (\r
) characters which are not followed by newline characters, you could use (f)lex's pattern matching algorithm to your advantage:
<SINGLE_COMMENT>{
[^\\\r\n]+ ;
\\\r?\n { ++line_count; }
\\. ; /* only matches if the above rule doesn't */
\r?\n { ++line_count; BEGIN(INITIAL); }
\r ; /* only matches if the above rule doesn't */
}
By the way, it's usually much easier to provide %option yylineno
than to try to track newlines manually.