%{
#undef yywrap
#define yywrap() 1
#include<stdio.h>
int statements = 0;
int ids = 0;
int assign = 0;
int rel = 0;
int keywords = 0;
int integers = 0;
%}
DIGIT [0-9]
LETTER [A-Za-z]
TYPE int|char|bool|float|void|for|do|while|if|else|return|void
%option yylineno
%option noyywrap
%%
\n {statements++;}
{TYPE} {/*printf("%s\n",yytext);*/keywords++;}
(<|>|<=|>=|==) {rel++;}
'#'/[a-zA-Z0-9]* {;}
[a-zA-Z]+[a-zA-Z0-9]* {printf("%s\n",yytext);ids++;}
= {assign++;}
[0-9]+ {integers++;}
. {;}
%%
void main(int argc, char **argv)
{
FILE *fh;
if (argc == 2 && (fh = fopen(argv[1], "r"))) {
yyin = fh;
}
yylex();
printf("statements = %d ids = %d assign = %d rel = %d keywords = %d integers = %d \n",statements,ids,assign,rel,keywords,integers);
}
//Input file.c
#include<stdio.h>
void main(){
float a123;
char a;
char b123;
char c;
int ab[5];
int bc[2];
int ca[7];
int ds[4];
for( a = 0; a < 5 ;a++)
printf("%d ", a);
return 0;
}
output:
include
stdio
h
main
a123
a
b123
c
ab
bc
ca
ds
a
a
a
printf
d
a
statements = 14 ids = 18 assign = 1 rel = 3 keywords = 11 integers = 7
I am printing the identifiers on the go. #include<stdio.h>
is being counted as identifier. How do I avoid this?
I have tried '#'/[a-zA-Z0-9]* {;}
rule:action pair but it is still being counted as identifier. How is the file being tokenized?
Also the %d
string in printf
is being counted as an identifier. I have explicitly written that identifiers should only begin with letters, then why is %d
being inferred as identifier?
I have tried
'#'/[a-zA-Z0-9]* {;}
rule:action pair but it [include
] is still being counted as identifier. How is the file being tokenized?
Tokens are recognized one at a time. Each token starts where the previous token finished.
'#'/[a-zA-Z0-9]*
matches '#' provided it is followed by [a-zA-Z0-9]*
. You probably meant "#"/[a-zA-Z0-9]*
(with double quotes) which would match a #, again provided it is followed by a letter or digit. Note that only the # is matched; the pattern after the /
is "trailing context", which is basically a lookahead assertion. In this case, the lookahead is pointless because [a-zA-Z0-9]*
can match the empty string, so any # would be matched. In any event, after the # is matched as a token, the scan continues at the next character. So the next token would be include
.
Because of the typo, that pattern does not match. (There are no apostrophes in the source.) So what actually matches is your "fallback" rule: the rule whose pattern is .
. (We call this a fallback rule because it matches anything. Really, it should be .|\n
, since .
matches anything but a newline, but as long as you have some rule which matches a newline character, it's acceptable to use .
. If you don't supply a fallback rule, one will be inserted automatically by flex with the action ECHO
.)
Thus, the # is ignored (just as it would have been if you'd written the rule as intended) and again the scan continues with the token include
.
If you wanted to ignore the entire preprocessor directive, you could do something like
^[[:blank:]]#.* { ; }
(from a comment) I am getting
stdio
andh
as keywords, how does that fit the definition that I have given? What happened to the.
in between?
After the < is ignored by the fallback rule, stdio
is matched. Since [a-zA-Z]+[a-zA-Z0-9]*
doesn't match anything other than letters and digits, the . is not considered part of the token. Then the . is matched and ignored by the fallback rule, and then h
is matched.
Also the
%d
string inprintf
is being counted as an identifier.
Not really. The % is explicitly ignored by the fallback rule (as was the ") and then the d
is marched as an identifier. If you want to ignore words in string literals, you will have to recognise and ignore string literals.