Search code examples
programming-languagesstringlanguage-designc++11

How to implement C++0x raw string literal?


How to define a working set of lexer and parser (exempli gratia: flex and bison) to support the C++0x styled raw string literals?

As you may already know, new string literals in C++0x can be expressed in a very flexible way.

R"<delim>...<delim>"; - in this code the <delim> can be pretty much everything and also no escape characters are needed.

Any kind of parentheses can be used to delimit the end of string:

R"(I love those who yearn for the impossible. (Von Goethe, "Faust"))";

Blocks of text can be simply defined using equal occurrences of same characters:

R";***************************(
  ; TINY BASIC FOR INTEL 8080  
  ;       VERSION 2.0  
  ;     BY LI-CHEN WANG  
  ; MODIFIED AND TRANSLATED  
  ;    TO INTEL MNEMONICS  
  ;     BY ROGER RAUSKOLB  
  ;     10 OCTOBER, 1976  
  ;       @COPYLEFT  
  ;  ALL WRONGS RESERVED      )
  ;***************************";

More information can be found here(wikipedia) and here(att).

I would like to use this fantastic feature in a language I am developing now.

So, how can I define a proper tokenizer and syntax analyzer to achive the result?

Thanks in advance for your answers!


Solution

  • You could proprocess literals in lexical analysis stage and transform them into something like meta token.

    Input:  
        int a;  
        char *b = R"....";  
    
    Preprocessed:  
        int a;
        char *b = R*literal[0]*;
    
    Tokenized:  
        INT symbol[0] DELIM  
        CHAR OP_ASTR symbol[1] OP_EQ symbol[2] *literal[0]* DELIM  
    
    Symbol table contents { "a", "b", "R" }  
    
    Literal table contents { "...." }  
    

    literal[0] is the pointer to the original literal text.