I'm implementing a custom parser generator with embedded lexer and parser to parse HTTP headers in an event-driven state machine way. Here's some definitions the eventual parser generator could consume to parse a single header field without CRLF at the end:
token host<prio=1> = "[Hh][Oo][Ss][Tt]" ;
token ospace = "[ \t]*" ;
token htoken = "[-!#$%&'*+.^_`|~0-9A-Za-z]+" ;
token hfield = "[\t\x20-\x7E\x80-\xFF]*" ;
token space = " " ;
token htab = "\t" ;
token colon = ":" ;
obsFoldStart = 1*( space | htab ) ;
hdrField =
obsFoldStart hfield
| host colon ospace hfield<print>
| htoken colon ospace hfield
;
The lexer is based on a maximal munch rule and the tokens are dynamically turned on and off depending on the context, so there is no conflict between htoken
and hfield
, and the priority value resolves the conflict between host
and htoken
. I'm planning to implement the parser as LL(1) table parser. I haven't yet decided if I'll implement regexp token matching by simulating the nondeterministic finite automaton or go all the way to exploding it to a deterministic finite automaton.
Now, I would like to include some C source code in my parser generator input:
hdrField =
obsFoldStart hfield
| host {
parserState->userdata.was_host = 1;
} colon ospace hfield<print>
| htoken {
parserState->userdata.was_host = 0;
} colon ospace hfield
;
What I need thus is some way to read text tokens that end when the same amount of }
characters are read than the amount of {
characters read.
How to do this? I'm handling comments using BEGIN(COMMENTS)
and BEGIN(INITIAL)
but I don't believe such a strategy would work for embedded C source. Also, the comment handling could complicate the embedded C source code handling a lot, because I don't believe a single token can have a comment in the middle of it.
Basically, I need the embedded C language snippet as a C string I can store to my data structures.
So, I took some of the generated lex code and made it self standing.
I hope, it's OK that I used C++ code although I recognized the c only. IMHO, it concerns only the not so
relevant parts of this sample code. (Memory management in C is much more tedious than simply delegating this to std::string
.)
scanC.l
:
%{
#include <iostream>
#include <string>
#ifdef _WIN32
/// disables #include <unistd.h>
#define YY_NO_UNISTD_H
#endif // _WIN32
// buffer for collected C/C++ code
static std::string cCode;
// counter for braces
static int nBraces = 0;
%}
/* Options */
/* make never interactive (prevent usage of certain C functions) */
%option never-interactive
/* force lexer to process 8 bit ASCIIs (unsigned characters) */
%option 8bit
/* prevent usage of yywrap */
%option noyywrap
EOL ("\n"|"\r"|"\r\n")
SPC ([ \t]|"\\"{EOL})*
LITERAL "\""("\\".|[^\\"])*"\""
%s CODE
%%
<INITIAL>"{" { cCode = '{'; nBraces = 1; BEGIN(CODE); }
<INITIAL>. |
<INITIAL>{EOL} { std::cout << yytext; }
<INITIAL><<EOF>> { return 0; }
<CODE>"{" {
cCode += '{'; ++nBraces;
//updateFilePos(yytext, yyleng);
} break;
<CODE>"}" {
cCode += '}'; //updateFilePos(yytext, yyleng);
if (!--nBraces) {
BEGIN(INITIAL);
//return new Token(filePosCCode, Token::TkCCode, cCode.c_str());
std::cout << '\n'
<< "Embedded C code:\n"
<< cCode << "// End of embedded C code\n";
}
} break;
<CODE>"/*" { // C comments
cCode += "/*"; //_filePosCComment = _filePos;
//updateFilePos(yytext, yyleng);
char c1 = ' ';
do {
char c0 = c1; c1 = yyinput();
switch (c1) {
case '\r': break;
case '\n':
cCode += '\n'; //updateFilePos(&c1, 1);
break;
default:
if (c0 == '\r' && c1 != '\n') {
c0 = '\n'; cCode += '\n'; //updateFilePos(&c0, 1);
} else {
cCode += c1; //updateFilePos(&c1, 1);
}
}
if (c0 == '*' && c1 == '/') break;
} while (c1 != EOF);
if (c1 == EOF) {
//ErrorFile error(_filePosCComment, "'/*' without '*/'!");
//throw ErrorFilePrematureEOF(_filePos);
std::cerr << "ERROR! '/*' without '*/'!\n";
return -1;
}
} break;
<CODE>"//"[^\r\n]* | /* C++ one-line comments */
<CODE>"'"("\\".|[^\\'])+"'" | /*"/* C/C++ character constants */
<CODE>{LITERAL} | /* C/C++ string constants */
<CODE>"#"[^\r\n]* | /* preprocessor commands */
<CODE>[ \t]+ | /* non-empty white space */
<CODE>[^\r\n] { // any other character except EOL
cCode += yytext;
//updateFilePos(yytext, yyleng);
} break;
<CODE>{EOL} { // special handling for EOL
cCode += '\n';
//updateFilePos(yytext, yyleng);
} break;
<CODE><<EOF>> { // premature EOF
//ErrorFile error(_filePosCCode,
// compose("%1 '{' without '}'!", _nBraces));
//_errorManager.add(error);
//throw ErrorFilePrematureEOF(_filePos);
std::cerr << "ERROR! Premature end of input. (Not enough '}'s.)\n";
}
%%
int main(int argc, char **argv)
{
return yylex();
}
A sample text to scan scanC.txt
:
Hello juhist.
The text without braces doesn't need to have any syntax.
It just echoes the characters until it finds a block:
{ // the start of C code
// a C++ comment
/* a C comment
* (Remember that nested /*s are not supported.)
*/
#define MAX 1024
static char buffer[MAX] = "", empty="\"\"";
/* It is important that tokens are recognized to a limited amount.
* Otherwise, it would be too easy to fool the scanner with }}}
* where they have no meaning.
*/
char *theSameForStringConstants = "}}}";
char *andCharConstants = '}}}';
int main() { return yylex(); }
}
This code should be just copied
(with a remark that the scanner recognized the C code a such.)
Greetings, Scheff.
Compiled and tested on cygwin64:
$ flex --version
flex 2.6.4
$ flex -o scanC.cc scanC.l
$ g++ --version
g++ (GCC) 7.3.0
$ g++ -std=c++11 -o scanC scanC.cc
$ ./scanC < scanC.txt
Hello juhist.
The text without braces doesn't need to have any syntax.
It just echoes the characters until it finds a block:
Embedded C code:
{ // the start of C code
// a C++ comment
/* a C comment
* (Remember that nested /*s are not supported.)
*/
#define MAX 1024
static char buffer[MAX] = "", empty="\"\"";
/* It is important that tokens are recognized to a limited amount.
* Otherwise, it would be too easy to fool the scanner with }}}
* where they have no meaning.
*/
char *theSameForStringConstants = "}}}";
char *andCharConstants = '}}}';
int main() { return yylex(); }
}// End of embedded C code
This code should be just copied
(with a remark that the scanner recognized the C code a such.)
Greetings, Scheff.
$
Notes:
This is taken from a helper tool (not for selling). Hence, this is not bullet-proof but just good enough for productive code.
What I saw when adapting it: The line continuation of pre-processor lines is not handled.
It's surely possible to fool the tool with a creative combination of macros with unbalanced {
}
– something we would never do in pur productive code (see 1.).
So, it might be at least a start for further development.
To check this against a C lex specification, I have ANSI C grammar, Lex specification at hand, though it's 22 years old. (There are probably newer ones available matching the current standards.)