Search code examples
cyacclex

How to create Yacc/Lex rules for embedding C source code snippets?


I'm implementing a custom parser generator with embedded lexer and parser to parse HTTP headers in an event-driven state machine way. Here's some definitions the eventual parser generator could consume to parse a single header field without CRLF at the end:

token host<prio=1> = "[Hh][Oo][Ss][Tt]" ;
token ospace = "[ \t]*" ;
token htoken = "[-!#$%&'*+.^_`|~0-9A-Za-z]+" ;
token hfield = "[\t\x20-\x7E\x80-\xFF]*" ;
token space = " " ;
token htab = "\t" ;
token colon = ":" ;

obsFoldStart = 1*( space | htab ) ;
hdrField =
  obsFoldStart hfield
| host colon ospace hfield<print>
| htoken colon ospace hfield
  ;

The lexer is based on a maximal munch rule and the tokens are dynamically turned on and off depending on the context, so there is no conflict between htoken and hfield, and the priority value resolves the conflict between host and htoken. I'm planning to implement the parser as LL(1) table parser. I haven't yet decided if I'll implement regexp token matching by simulating the nondeterministic finite automaton or go all the way to exploding it to a deterministic finite automaton.

Now, I would like to include some C source code in my parser generator input:

hdrField =
  obsFoldStart hfield
| host {
  parserState->userdata.was_host = 1;
} colon ospace hfield<print>
| htoken {
  parserState->userdata.was_host = 0;
} colon ospace hfield
  ;

What I need thus is some way to read text tokens that end when the same amount of } characters are read than the amount of { characters read.

How to do this? I'm handling comments using BEGIN(COMMENTS) and BEGIN(INITIAL) but I don't believe such a strategy would work for embedded C source. Also, the comment handling could complicate the embedded C source code handling a lot, because I don't believe a single token can have a comment in the middle of it.

Basically, I need the embedded C language snippet as a C string I can store to my data structures.


Solution

  • So, I took some of the generated lex code and made it self standing.

    I hope, it's OK that I used C++ code although I recognized the only. IMHO, it concerns only the not so relevant parts of this sample code. (Memory management in C is much more tedious than simply delegating this to std::string.)

    scanC.l:

    %{
    
    #include <iostream>
    #include <string>
    
    #ifdef _WIN32
    /// disables #include <unistd.h>
    #define YY_NO_UNISTD_H
    #endif // _WIN32
    
    // buffer for collected C/C++ code
    static std::string cCode;
    // counter for braces
    static int nBraces = 0;
    
    %}
    
    /* Options */
    
    /* make never interactive (prevent usage of certain C functions) */
    %option never-interactive
    /* force lexer to process 8 bit ASCIIs (unsigned characters) */
    %option 8bit
    /* prevent usage of yywrap */
    %option noyywrap
    
    
    EOL ("\n"|"\r"|"\r\n")
    SPC ([ \t]|"\\"{EOL})*
    LITERAL "\""("\\".|[^\\"])*"\""
    
    %s CODE
    
    %%
    
    <INITIAL>"{" { cCode = '{'; nBraces = 1; BEGIN(CODE); }
    <INITIAL>. |
    <INITIAL>{EOL} { std::cout << yytext; }
    <INITIAL><<EOF>> { return 0; }
    
    <CODE>"{" {
      cCode += '{'; ++nBraces;
      //updateFilePos(yytext, yyleng);
    } break;
    <CODE>"}" {
      cCode += '}'; //updateFilePos(yytext, yyleng);
      if (!--nBraces) {
        BEGIN(INITIAL);
        //return new Token(filePosCCode, Token::TkCCode, cCode.c_str());
        std::cout << '\n'
          << "Embedded C code:\n"
          << cCode << "// End of embedded C code\n";
      }
    } break;
    
    <CODE>"/*" { // C comments
      cCode += "/*"; //_filePosCComment = _filePos;
      //updateFilePos(yytext, yyleng);
      char c1 = ' ';
      do {
        char c0 = c1; c1 = yyinput();
        switch (c1) {
          case '\r': break;
          case '\n':
            cCode += '\n'; //updateFilePos(&c1, 1);
            break;
          default:
            if (c0 == '\r' && c1 != '\n') {
              c0 = '\n'; cCode += '\n'; //updateFilePos(&c0, 1);
            } else {
              cCode += c1; //updateFilePos(&c1, 1);
            }
        }
        if (c0 == '*' && c1 == '/') break;
      } while (c1 != EOF);
      if (c1 == EOF) {
        //ErrorFile error(_filePosCComment, "'/*' without '*/'!");
        //throw ErrorFilePrematureEOF(_filePos);
        std::cerr << "ERROR! '/*' without '*/'!\n";
        return -1;
      }
    } break;
    <CODE>"//"[^\r\n]* | /* C++ one-line comments */
    <CODE>"'"("\\".|[^\\'])+"'" | /*"/* C/C++ character constants */
    <CODE>{LITERAL} | /* C/C++ string constants */
    <CODE>"#"[^\r\n]* | /* preprocessor commands */
    <CODE>[ \t]+ | /* non-empty white space */
    <CODE>[^\r\n] { // any other character except EOL
      cCode += yytext;
      //updateFilePos(yytext, yyleng);
    } break;
    <CODE>{EOL} { // special handling for EOL
      cCode += '\n';
      //updateFilePos(yytext, yyleng);
    } break;
    <CODE><<EOF>> { // premature EOF
      //ErrorFile error(_filePosCCode,
      //  compose("%1 '{' without '}'!", _nBraces));
      //_errorManager.add(error);
      //throw ErrorFilePrematureEOF(_filePos);
      std::cerr << "ERROR! Premature end of input. (Not enough '}'s.)\n";
    }
    
    %%
    
    int main(int argc, char **argv)
    {
      return yylex();
    }
    

    A sample text to scan scanC.txt:

    Hello juhist.
    
    The text without braces doesn't need to have any syntax.
    It just echoes the characters until it finds a block:
    { // the start of C code
      // a C++ comment
      /* a C comment
       * (Remember that nested /*s are not supported.)
       */
      #define MAX 1024
      static char buffer[MAX] = "", empty="\"\"";
    
      /* It is important that tokens are recognized to a limited amount.
       * Otherwise, it would be too easy to fool the scanner with }}}
       * where they have no meaning.
       */
      char *theSameForStringConstants = "}}}";
      char *andCharConstants = '}}}';
    
      int main() { return yylex(); }
    }
    This code should be just copied
    (with a remark that the scanner recognized the C code a such.)
    
    Greetings, Scheff.
    

    Compiled and tested on cygwin64:

    $ flex --version
    flex 2.6.4
    
    $ flex -o scanC.cc scanC.l
    
    $ g++ --version
    g++ (GCC) 7.3.0
    
    $ g++ -std=c++11 -o scanC scanC.cc
    
    $ ./scanC < scanC.txt
    Hello juhist.
    
    The text without braces doesn't need to have any syntax.
    It just echoes the characters until it finds a block:
    
    Embedded C code:
    { // the start of C code
      // a C++ comment
      /* a C comment
       * (Remember that nested /*s are not supported.)
       */
      #define MAX 1024
      static char buffer[MAX] = "", empty="\"\"";
    
      /* It is important that tokens are recognized to a limited amount.
       * Otherwise, it would be too easy to fool the scanner with }}}
       * where they have no meaning.
       */
      char *theSameForStringConstants = "}}}";
      char *andCharConstants = '}}}';
    
      int main() { return yylex(); }
    
    }// End of embedded C code
    This code should be just copied
    (with a remark that the scanner recognized the C code a such.)
    
    Greetings, Scheff.
    $
    

    Notes:

    1. This is taken from a helper tool (not for selling). Hence, this is not bullet-proof but just good enough for productive code.

    2. What I saw when adapting it: The line continuation of pre-processor lines is not handled.

    3. It's surely possible to fool the tool with a creative combination of macros with unbalanced { } – something we would never do in pur productive code (see 1.).

    So, it might be at least a start for further development.

    To check this against a C lex specification, I have ANSI C grammar, Lex specification at hand, though it's 22 years old. (There are probably newer ones available matching the current standards.)