Here is an XML start tag and end tag with Hello, world
inside the start-tag/end-tag pair:
<foo>Hello, world</foo>
In XML there is something called a CDATA section. It has this unusual syntax:
<![CDATA[...]]>
A CDATA section is a wrapper around data.
If a start-tag/end-tag pair contains a CDATA section, then the data inside the start-tag/end-tag pair is the concatenation of the data outside the CDATA section with the data inside the CDATA section. For example, the content of foo:
<foo>First expression <![CDATA[A < B]]>, second expression <![CDATA[C < D + 1]]>.</foo>
is this:
First expression A < B, second expression C < D + 1.
Question: How to scoop up each piece within foo and concatenate the pieces together? That is, how to scoop up these pieces ("First expression "
, "A < B"
, ", second expression "
, "C < D + 1"
, "."
) and concatenate them together?
Below is a lexer I created. It works fine if foo doesn't have any CDATA sections but when foo has a CDATA section the lexer hangs.
Notice that my lexer uses yyless() and yymore(). I am imitating the example at the bottom of page 137 in the book Flex & Bison. The lexer scoops up the characters before the CDATA section plus the CDATA start syntax, then it pushes the CDATA start syntax back into the input and calls yymore(). Another rule discards the CDATA start syntax. I think this is not the right approach. What is the right way to accomplish this? Is there a way to solve this problem without using yyless() and yymore()?
%option noyywrap
%x ELEMENT_CONTENT
%{
enum yytokentype {
TOK_START_TAG = 258,
TOK_END_TAG = 259,
TOK_ELEMENT_CONTENT = 260
};
%}
%%
<INITIAL>{
"<foo>" { BEGIN(ELEMENT_CONTENT); return(TOK_START_TAG); }
"</foo>" { return(TOK_END_TAG); }
}
<ELEMENT_CONTENT>{
[^<]+"<![CDATA[" { yyless(9); yymore(); }
"<![CDATA[" { /* ignore CDATA start syntax */ }
[^\]]+"]]>" { yyless(3); yymore(); }
"]]>" { /* ignore CDATA end syntax */ }
[^<]* { BEGIN(INITIAL); return TOK_ELEMENT_CONTENT; }
}
%%
int main(int argc, char *argv[])
{
printf("In the lexer\n");
yyin = fopen(argv[1], "r");
int tok;
while (tok = yylex()) {
switch (tok){
case 258:
printf("TOK_START_TAG: %s\n", yytext);
break;
case 259:
printf("TOK_END_TAG: %s\n", yytext);
break;
case 260:
printf("TOK_ELEMENT_CONTENT: %s\n", yytext);
break;
default:
printf("unexpected: %s\n", yytext);
}
}
fclose(yyin);
return 0;
}
The text inside an HTML tag can be interrupted by a number of things, not just CDATA sections. It could contain entity references, or numerical entity references. It could contain comments, which will mostly be ignored, or it could contain content-less tags which are not of interest to the parser. And so on. A lexer which eliminated those things might be useful in a particular context, or it might create a problem which the lexer's client will end up tearing their hair out trying to solve. On the whole, these are issues the lexer should probably not attempt to solve; it's better design to just document the fact that text might be lexed as several consecutive TEXT tokens (or several consecutive "text-like" tokens). That's certainly the way I would do it.
I understand that you're trying to take advantage of Flex's internal buffer to do the token concatenation in place, avoiding extra memory allocation and copying. That's a tempting optimisation, but it drops you into a twisty maze of details and corner cases. Furthermore, Flex is designed around the idea that tokens are "not too long"; extremely long tokens can trigger inefficiencies in the Flex algorithm.
There are well-known techniques for optimising string assembly, and you'd probably be better off using one of them (or a library which implements efficient string concatenation) and leave Flex's internals to Flex. :-)