Search code examples
cregexflex-lexerlex

How to pick only first N number of characters and drop remaining while matching pattern in LEX/FLEX


I need to write Lex/Flex code to recognize Identifiers. Here, Identifiers is defined as -

Identifiers- It can have alpha (lowercase) numeric combination, start with alphabet or underscore and only first 20 characters to be picked and remaining to be dropped.

My problem is how to pick only first 20 characters and drop remaining. Sample input:

_sdfasdfjh89234792jashdf

89ajshdf

Required Output:

_sdfasdfjh89234792ja is an Identifier

89ajshdf is a normal string

After many tries, I came up with below solution, but that's not the required output. Output I get:

 _sdfasdfjh89234792jashdf is an Identifier
 89ajshdf is a normal string

My solution code:

%{
%}

%%
([a-z]|_)[_a-z0-9]{1,19} {printf("%s is an identifier\n",yytext);}
.* {printf("%s is normal string\n",yytext);} /* we will use ctrl+d to exit*/
%%

int yywrap(){}
int main(){
yylex();
return 0;
}

Solution

  • The problem statement asks you to recognise identifiers and then use the first 20 characters of each one. That's quite different from accepting up to 20 characters as an identifier token, which is what your code is trying to do, because after you scan the 20 characters, the rest of the identifier is still in the input stream and the next scan will pick it up as a second token, which is not desired. So you need to get rid of the bounded repetition operator {1,19}.

    Once you have the token in yytext, you need to truncate it in the action. That's simple C string manipulation. The only relevant (f)lex feature useful here is that it sets the global yyleng to the length of the token (which is in yytext).

    yytext is an internal temporary buffer, so if you want its contents to outlive the (f)lex action, you need to make a copy. But if all you want to do is print out at most 20 characters of the token, you can just use a length limitation in your printf format string:

    [a-z_][a-z0-9_]*   { printf("%.20s is an identifier.\n", yytext); }
    

    You will also need to change your second rule, since .* will match to the end of the current line. Unless the identifier is at the exact end of the line, .* will produce a longer match and the identifier rule will not be used. (F)lex always chooses the longest possible match; it only gives priority to the rule order nin case two or more rules all produce the same longest match.

    If you did want to return the string value, you would want to make a copy of up to 20 characters. The easiest way to do that is with the strndup function:

    yylval = strndup(yytext, 20); /* This is a Posix function, so it's not in all C libraries. */
    

    If you don't have strndup, you'll have to make the copy yourself, in which case yyleng comes in handy:

    if (yyleng > 20) yyleng = 20;
    yylval = malloc(yyleng + 1);
    memcpy(yylval, yytext, yyleng);
    yylval[yyleng] = `\0`;
    

    Notes

    1. You need to check the value returned by strndup or malloc to ensure that it is not NULL. NULL would indicate an out-of-memory error. You also need to declare yylval somewhere; that will be automatic if you are using yacc/bison to parse, but you'll need to tell yacc/bison that yylval is a char* and not the default int. And don't forget to free the allocated string when you no longer need it.

    2. yyleng makes this slightly more efficient but if you were in some other coding environment you could just use strnlen to compute the bounded string length:

       leng = strnlen(yytext, 20);
       yylval = malloc(leng + 1);
       memcpy(yylval, yytext, leng);
       yylval[leng] = `\0`;
      

      Don't use strlen and then test. strlen has to count to the end of the string no matter how long it is, and you don't care what the precise long count is. strnlen stops counting when it reaches the limit, which avoids that extra work. It's unlikely to make a big difference in a scanner, but it's a good habit to get into for the cases where it is a big win.