I extract the name of tag from input file on matching with pattern "<"([a-z][a-z]*)
, subsequently extracting substring that contains only the name part in a string (character array) tagName
, using the function strncpy
.
Most seems to work well, when first <div
is matched, except that the length of tagName
which is the extracted substring obtained in strncpy(tagName, yytext + 1, yytextlen - 1);
should be 3 for div, but get 6 instead.
During second match of input <a
, the value of tagName
during first match persists here and affects the output in action.
I am unable to understand the reason for this behaviour.
Expected output:
tagName: 0\y�� /* Garbage value */, tagName length: 6 /* Why is this 6 */
An html tag { yytext: <a, yyleng: 2, text: a, len: 1 }
Current output: Please see the output section below.
I am using flex version 2.6.0 and gcc compiler on Ubuntu 16.
lex file:
HTML_TAG [a-z][a-z]*
REACT_TAG [A-Z][[:alnum:]]*
%%
"<"{HTML_TAG} {
int yytextlen = yyleng;
char tagName[yytextlen - 1];
printf("tagName: %s, tagName length: %lu\n", tagName, strlen(tagName));
strncpy(tagName, yytext + 1, yytextlen - 1);
printf("An html tag { yytext: %s, yyleng: %zu, text: %s, len: %lu }\n", yytext, yyleng, tagName, strlen(tagName));
printf("\n\n");
}
"<"{REACT_TAG} {
printf("A react tag { text: %s, len: %zu }\n", yytext, yyleng);
printf("\n\n");
}
"</"{HTML_TAG} {
printf("A closing html tag { text: %s, len: %zu }\n", yytext, yyleng);
printf("\n\n");
}
"</"{REACT_TAG} {
printf("A closing react tag { text: %s, len: %zu }\n", yytext, yyleng);
printf("\n\n");
}
[ \t\n] /* eat up whitespace */
. ;
%%
int main(int argc, char **argv)
{
++argv, --argc; /* skip over program name */
if (argc > 0)
yyin = fopen(argv[0], "r");
else
yyin = stdin;
yylex();
}
Input file:
<div
attribute1=""
attribute2=""
attribute3=""
>
<a>
Text Content
</a>
<ReactTag attribute4="" />
<ReactTag2 attribute5="">
Text Content
</ReactTag2>
</div>
Output:
tagName: 0\y��, tagName length: 6
An html tag { yytext: <div, yyleng: 4, text: div��, len: 6 }
tagName: div��, tagName length: 6
An html tag { yytext: <a, yyleng: 2, text: aiv��, len: 6 }
An closing html tag { text: </a, len: 3 }
A react tag { text: <ReactTag, len: 9 }
A react tag { text: <ReactTag2, len: 10 }
An closing react tag { text: </ReactTag2, len: 11 }
An closing html tag { text: </div, len: 5 }
I only looked at the action you indicated. If you make these same errors elsewhere, I'm sure you'll be able to find them.
Here's the "<"{HTML_TAG}
action, with my commentaries:
{
int yytextlen = yyleng;
What's the point of this variable? yyleng
is not going to change its value during the execution of this action. Just use it.
char tagName[yytextlen - 1];
You want to save a tagname whose with yyleng - 1
characters (since yyleng
includes the <
.) That means you need a temporary string whose size is yyleng - 1 + 1
(or yyleng
for short), because you need to NUL-terminate the copy. Except you don't really need this copy. But we'll get back to that later.
printf("tagName: %s, tagName length: %lu\n", tagName, strlen(tagName));
I know you're planning on copy yytext
into tagName
, but you haven't done it yet. So at this point it is uninitialised storage. Trying to print it is Undefined Behaviour. Trying to get its length with strlen
is Undefined Behaviour. Obviously, this is going to print garbage. (Anyway, why do you need to compute strlen
? You know exactly how long this string is going to be: yyleng - 1
.)
strncpy(tagName, yytext + 1, yytextlen - 1);
At some point I'm going to give up on this argument, but this is a great illustration of why you shouldn't use strncpy
(except in the rare use case for which it was designed: fixed length database fields which don't require NUL termination). People seem to think strncpy
is "safer" that strcpy
because it has an n
in its name. But it's actually extremely unsafe, even less safe than strcpy
, because it doesn't NUL-terminate the copy. As we've seen above, you didn't leave enough space for the NUL terminator either, so if you had used strcpy
, which does NUL terminate, then it would have written the NUL terminator outside of the buffer. But if you had made the buffer big enough, strcpy
would have been exactly correct.
Also, in the case where the source string in a strncpy
is shorter than the target, strncpy
fills the rest of the target with NULs. All of it. This is almost always a waste of cycles (except in cases like this, where it writes no NULs at all and produces Undefined Behaviour).
If you really want to make a copy of a string limited to a maximum length, use strndup
(if your C library includes it, which most do these days). strndup
copies the string with a length limit. And it NUL-terminates the copy. And it dynamically allocates exactly enough space for the copy. If you want a safe interface, that's the one to use.
But why do you feel that you need to make a copy here? If you're planning to pass the token on to a parser, then you will indeed need a copy, but a local variable-length array is not the copy you will need, because the lifetime of the local array ends as soon as the action terminates, which is long before the copy could be used. If you did need a copy, you would need a dynamically-allocated copy. And that's precisely what strndup
would have given you.
printf("An html tag { yytext: %s, yyleng: %zu, text: %s, len: %lu }\n",
yytext, yyleng, tagName, strlen(tagName));
So now you have made your copy. But you made it with a library function that doesn't NUL-terminate, so it's still Undefined Behaviour to use the copy as though it were a string. It would only be a string if it were NUL-terminated. And it is still Undefined Behaviour to pass the copy to strlen
as though it were a string.
On the other hand, printing out yytext
and yyleng
is just fine.
printf("\n\n");
}
At this point, the action ends. tagName
no longer exists. It's lifetime has come to an end. yytext
is still OK, but not for long: as soon as the scanner starts looking for the next token (which is right away since your action doesn't return
the token to its caller), it will take back control of yytext
, modifying its contents in unpredictable ways. So if you had needed a copy of the string to return with the token type, you would have had to have already made a copy which was still alive.
Hope that all helps.