I need to write Lex/Flex
code to recognize Identifiers
.
Here, Identifiers
is defined as -
Identifiers- It can have alpha (lowercase) numeric combination, start with alphabet or underscore and only first 20 characters to be picked and remaining to be dropped.
My problem is how to pick only first 20 characters and drop remaining. Sample input:
_sdfasdfjh89234792jashdf
89ajshdf
Required Output:
_sdfasdfjh89234792ja is an Identifier
89ajshdf is a normal string
After many tries, I came up with below solution, but that's not the required output. Output I get:
_sdfasdfjh89234792jashdf is an Identifier
89ajshdf is a normal string
My solution code:
%{
%}
%%
([a-z]|_)[_a-z0-9]{1,19} {printf("%s is an identifier\n",yytext);}
.* {printf("%s is normal string\n",yytext);} /* we will use ctrl+d to exit*/
%%
int yywrap(){}
int main(){
yylex();
return 0;
}
The problem statement asks you to recognise identifiers and then use the first 20 characters of each one. That's quite different from accepting up to 20 characters as an identifier token, which is what your code is trying to do, because after you scan the 20 characters, the rest of the identifier is still in the input stream and the next scan will pick it up as a second token, which is not desired. So you need to get rid of the bounded repetition operator {1,19}
.
Once you have the token in yytext
, you need to truncate it in the action. That's simple C string manipulation. The only relevant (f)lex feature useful here is that it sets the global yyleng
to the length of the token (which is in yytext
).
yytext
is an internal temporary buffer, so if you want its contents to outlive the (f)lex action, you need to make a copy. But if all you want to do is print out at most 20 characters of the token, you can just use a length limitation in your printf
format string:
[a-z_][a-z0-9_]* { printf("%.20s is an identifier.\n", yytext); }
You will also need to change your second rule, since .*
will match to the end of the current line. Unless the identifier is at the exact end of the line, .*
will produce a longer match and the identifier rule will not be used. (F)lex always chooses the longest possible match; it only gives priority to the rule order nin case two or more rules all produce the same longest match.
If you did want to return the string value, you would want to make a copy of up to 20 characters. The easiest way to do that is with the strndup
function:
yylval = strndup(yytext, 20); /* This is a Posix function, so it's not in all C libraries. */
If you don't have strndup
, you'll have to make the copy yourself, in which case yyleng
comes in handy:
if (yyleng > 20) yyleng = 20;
yylval = malloc(yyleng + 1);
memcpy(yylval, yytext, yyleng);
yylval[yyleng] = `\0`;
You need to check the value returned by strndup
or malloc
to ensure that it is not NULL. NULL would indicate an out-of-memory error. You also need to declare yylval
somewhere; that will be automatic if you are using yacc/bison to parse, but you'll need to tell yacc/bison that yylval
is a char*
and not the default int
. And don't forget to free
the allocated string when you no longer need it.
yyleng
makes this slightly more efficient but if you were in some other coding environment you could just use strnlen
to compute the bounded string length:
leng = strnlen(yytext, 20);
yylval = malloc(leng + 1);
memcpy(yylval, yytext, leng);
yylval[leng] = `\0`;
Don't use strlen
and then test. strlen
has to count to the end of the string no matter how long it is, and you don't care what the precise long count is. strnlen
stops counting when it reaches the limit, which avoids that extra work. It's unlikely to make a big difference in a scanner, but it's a good habit to get into for the cases where it is a big win.