c compiler-construction tokenize flex-lexer

Lex/flex program to count ids, statements, keywords, operators etc

%{
#undef yywrap
#define yywrap() 1
#include<stdio.h>
  int statements = 0;
  int ids = 0;
  int assign = 0;
  int rel = 0;
  int keywords = 0;
  int integers = 0; 
%}
DIGIT [0-9]
LETTER [A-Za-z]
TYPE int|char|bool|float|void|for|do|while|if|else|return|void
%option yylineno
%option noyywrap

%%
\n {statements++;}
{TYPE} {/*printf("%s\n",yytext);*/keywords++;}
(<|>|<=|>=|==) {rel++;}
'#'/[a-zA-Z0-9]*    {;}
[a-zA-Z]+[a-zA-Z0-9]* {printf("%s\n",yytext);ids++;}
= {assign++;}
[0-9]+ {integers++;}
.      {;}

%%
void main(int argc, char **argv)
{
  FILE *fh;
  if (argc == 2 && (fh = fopen(argv[1], "r"))) {
    yyin = fh;
  }
  yylex();
  printf("statements = %d ids = %d assign = %d rel = %d keywords = %d integers = %d \n",statements,ids,assign,rel,keywords,integers);
}

//Input file.c

#include<stdio.h>
void main(){
    float a123;
    char a;
    char b123;
    char c;
    int ab[5];
    int bc[2];
    int ca[7];
    int ds[4];
    for( a = 0; a < 5 ;a++)
     printf("%d ", a);
    return 0;
}

output:

include
stdio
h
main
a123
a
b123
c
ab
bc
ca
ds
a
a
a
printf
d
a
statements = 14 ids = 18 assign = 1 rel = 3 keywords = 11 integers = 7

I am printing the identifiers on the go. #include<stdio.h> is being counted as identifier. How do I avoid this?

I have tried '#'/[a-zA-Z0-9]* {;} rule:action pair but it is still being counted as identifier. How is the file being tokenized?

Also the %d string in printf is being counted as an identifier. I have explicitly written that identifiers should only begin with letters, then why is %d being inferred as identifier?

Solution

I have tried '#'/[a-zA-Z0-9]* {;} rule:action pair but it [include] is still being counted as identifier. How is the file being tokenized?

Tokens are recognized one at a time. Each token starts where the previous token finished.

'#'/[a-zA-Z0-9]* matches '#' provided it is followed by [a-zA-Z0-9]*. You probably meant "#"/[a-zA-Z0-9]* (with double quotes) which would match a #, again provided it is followed by a letter or digit. Note that only the # is matched; the pattern after the / is "trailing context", which is basically a lookahead assertion. In this case, the lookahead is pointless because [a-zA-Z0-9]* can match the empty string, so any # would be matched. In any event, after the # is matched as a token, the scan continues at the next character. So the next token would be include.

Because of the typo, that pattern does not match. (There are no apostrophes in the source.) So what actually matches is your "fallback" rule: the rule whose pattern is .. (We call this a fallback rule because it matches anything. Really, it should be .|\n, since . matches anything but a newline, but as long as you have some rule which matches a newline character, it's acceptable to use .. If you don't supply a fallback rule, one will be inserted automatically by flex with the action ECHO.)

Thus, the # is ignored (just as it would have been if you'd written the rule as intended) and again the scan continues with the token include.

If you wanted to ignore the entire preprocessor directive, you could do something like

^[[:blank:]]#.* { ; }
(from a comment) I am getting stdio and h as keywords, how does that fit the definition that I have given? What happened to the . in between?

After the < is ignored by the fallback rule, stdio is matched. Since [a-zA-Z]+[a-zA-Z0-9]* doesn't match anything other than letters and digits, the . is not considered part of the token. Then the . is matched and ignored by the fallback rule, and then h is matched.
Also the %d string in printf is being counted as an identifier.

Not really. The % is explicitly ignored by the fallback rule (as was the ") and then the d is marched as an identifier. If you want to ignore words in string literals, you will have to recognise and ignore string literals.