Search code examples
flex-lexer

flex lexer : Which variable should I update after yytext change?


I am trying to write a simple compiler. I am currently in the scanner part. Concerning string token, I have the following rule in the flex file :

\"([^\\\n]|\\.)*\" { clean_string(); return TK_STRING; }

It works perfectly (this is not the question). clean_string function is called to removed leading and trailing " and to transform \n and \t to their corresponding ascii character.

int clean_string () {
  char * mystr;

  mystr=strdup(yytext+1) ; // copy yytext and remove leading "
  if (! mystr) return 1;
  mystr[yyleng-2]='\0'; // remove trailing "
  for (int i=0, j=0; i<=strlen(mystr); i++, j++) { // "<=" and not "<" to get /0, i : mystr indice and j : yytext indice
    if (mystr[i]=='\\') {
      i++;
      if (mystr[i]=='n')        yytext[j]='\n';
      else if (mystr[i]=='t')   yytext[j]='\t';
      else yytext[j]=mystr[i];
    }
    else yytext[j]=mystr[i];
  }
  yyleng=strlen(yytext);
  free(mystr);
  return 0 ;
}

It also works perfectly.

My question is the following :
At the end of the function, I update yyleng because yytext has changed. I wonder if I have another variable to update to avoid some unexpected behavior in another part of the program.


Solution

  • Unless you use yymore() in your action (and evidently, you do not), the flex-generated scanner does not require yyleng to reflect the length of yytext. You can modify yyleng in any way, or you can modify the contents of yytext between index 0 and index yyleng-1, including making it shorter.

    Having said that, you need to be aware that the contents of yytext are only stable until the next time you call yylex. In almost all applications, particularly if you are planning on using the scanner from a parser with lookahead (such as a parser generated by yacc/bison), you will want the scanner to use a copy of the contents of yytext. In particular, yacc/bison generated scanners expect to find the semantic value of tokens (that is, the token string or some value derived from it) in some member of the union yylval, generally in the form of a pointer.

    So I'd strongly recommend that your function put the desired string contents into mystr and then return it (rather than freeing it immediately), and that the action place the pointer in a place where the parser can use it. That will require only a minor modification to your code and will make the scanner usable with a yacc/bison-generated parser.