Search code examples
ccompiler-constructionflex-lexer

string declaration in action corresponding to a pattern affects next action corresponding to pattern


I extract the name of tag from input file on matching with pattern "<"([a-z][a-z]*), subsequently extracting substring that contains only the name part in a string (character array) tagName, using the function strncpy.

Most seems to work well, when first <div is matched, except that the length of tagName which is the extracted substring obtained in strncpy(tagName, yytext + 1, yytextlen - 1); should be 3 for div, but get 6 instead.

During second match of input <a, the value of tagName during first match persists here and affects the output in action.

I am unable to understand the reason for this behaviour.

Expected output:

tagName: 0\y�� /* Garbage value */, tagName length: 6 /* Why is this 6 */
An html tag { yytext: <a, yyleng: 2, text: a, len: 1 }

Current output: Please see the output section below.

I am using flex version 2.6.0 and gcc compiler on Ubuntu 16.


lex file:

HTML_TAG    [a-z][a-z]*                                                                                              
REACT_TAG   [A-Z][[:alnum:]]*

%%

"<"{HTML_TAG}   {
  int yytextlen = yyleng;
  char tagName[yytextlen - 1];
  printf("tagName: %s, tagName length: %lu\n", tagName, strlen(tagName));
  strncpy(tagName, yytext + 1, yytextlen - 1);                                                                       
  printf("An html tag { yytext: %s, yyleng: %zu, text: %s, len: %lu }\n", yytext, yyleng, tagName, strlen(tagName)); 
  printf("\n\n");
} 

"<"{REACT_TAG}  {
                      printf("A react tag { text: %s, len: %zu }\n", yytext, yyleng);
                      printf("\n\n");
}                     

"</"{HTML_TAG}  {                                                                                                    
                      printf("A closing html tag { text: %s, len: %zu }\n", yytext, yyleng);                        
                      printf("\n\n");                                                                                
}                     

"</"{REACT_TAG} {                                                                                                    
                      printf("A closing react tag { text: %s, len: %zu }\n", yytext, yyleng);                       
                      printf("\n\n");                                                                                
}                     

[ \t\n] /* eat up whitespace */                                                                                      

.   ;

%%  

int main(int argc, char **argv)                                                                                      
{
  ++argv, --argc; /* skip over program name */                                                                       
  if (argc > 0)
    yyin = fopen(argv[0], "r");
  else
    yyin = stdin;

  yylex(); 
} 

Input file:

<div
   attribute1=""
   attribute2=""
   attribute3=""
>
    <a>
        Text Content
    </a>
    <ReactTag attribute4="" />
    <ReactTag2 attribute5="">
        Text Content
    </ReactTag2>
</div>

Output:

tagName: 0\y��, tagName length: 6
An html tag { yytext: <div, yyleng: 4, text: div��, len: 6 }

tagName: div��, tagName length: 6
An html tag { yytext: <a, yyleng: 2, text: aiv��, len: 6 }

An closing html tag { text: </a, len: 3 }


A react tag { text: <ReactTag, len: 9 }


A react tag { text: <ReactTag2, len: 10 }


An closing react tag { text: </ReactTag2, len: 11 }


An closing html tag { text: </div, len: 5 }

Solution

  • I only looked at the action you indicated. If you make these same errors elsewhere, I'm sure you'll be able to find them.

    Here's the "<"{HTML_TAG} action, with my commentaries:

    {
      int yytextlen = yyleng;
    

    What's the point of this variable? yyleng is not going to change its value during the execution of this action. Just use it.

      char tagName[yytextlen - 1];
    

    You want to save a tagname whose with yyleng - 1 characters (since yyleng includes the <.) That means you need a temporary string whose size is yyleng - 1 + 1 (or yyleng for short), because you need to NUL-terminate the copy. Except you don't really need this copy. But we'll get back to that later.

      printf("tagName: %s, tagName length: %lu\n", tagName, strlen(tagName));
    

    I know you're planning on copy yytext into tagName, but you haven't done it yet. So at this point it is uninitialised storage. Trying to print it is Undefined Behaviour. Trying to get its length with strlen is Undefined Behaviour. Obviously, this is going to print garbage. (Anyway, why do you need to compute strlen? You know exactly how long this string is going to be: yyleng - 1.)

      strncpy(tagName, yytext + 1, yytextlen - 1);
    

    At some point I'm going to give up on this argument, but this is a great illustration of why you shouldn't use strncpy (except in the rare use case for which it was designed: fixed length database fields which don't require NUL termination). People seem to think strncpy is "safer" that strcpy because it has an n in its name. But it's actually extremely unsafe, even less safe than strcpy, because it doesn't NUL-terminate the copy. As we've seen above, you didn't leave enough space for the NUL terminator either, so if you had used strcpy, which does NUL terminate, then it would have written the NUL terminator outside of the buffer. But if you had made the buffer big enough, strcpy would have been exactly correct.

    Also, in the case where the source string in a strncpy is shorter than the target, strncpy fills the rest of the target with NULs. All of it. This is almost always a waste of cycles (except in cases like this, where it writes no NULs at all and produces Undefined Behaviour).

    If you really want to make a copy of a string limited to a maximum length, use strndup (if your C library includes it, which most do these days). strndup copies the string with a length limit. And it NUL-terminates the copy. And it dynamically allocates exactly enough space for the copy. If you want a safe interface, that's the one to use.

    But why do you feel that you need to make a copy here? If you're planning to pass the token on to a parser, then you will indeed need a copy, but a local variable-length array is not the copy you will need, because the lifetime of the local array ends as soon as the action terminates, which is long before the copy could be used. If you did need a copy, you would need a dynamically-allocated copy. And that's precisely what strndup would have given you.

      printf("An html tag { yytext: %s, yyleng: %zu, text: %s, len: %lu }\n",
                            yytext, yyleng, tagName, strlen(tagName));
    

    So now you have made your copy. But you made it with a library function that doesn't NUL-terminate, so it's still Undefined Behaviour to use the copy as though it were a string. It would only be a string if it were NUL-terminated. And it is still Undefined Behaviour to pass the copy to strlen as though it were a string.

    On the other hand, printing out yytext and yyleng is just fine.

      printf("\n\n");
    }
    

    At this point, the action ends. tagName no longer exists. It's lifetime has come to an end. yytext is still OK, but not for long: as soon as the scanner starts looking for the next token (which is right away since your action doesn't return the token to its caller), it will take back control of yytext, modifying its contents in unpredictable ways. So if you had needed a copy of the string to return with the token type, you would have had to have already made a copy which was still alive.

    Hope that all helps.