Search code examples
antlrantlr4parser-generator

Grammar to negate two like characters in a lexer rule inside a single quoted string


ANLTR 4:

I need to support a single quoted string literal with escaped characters AND the ability to use double curly braces as an 'escape sequence' that will need additional parsing. So both of these examples need to be supported. I'm not so worried about the second example because that seems trivial if I can get the first to work and not match double curly brace characters.

1. 'this is a string literal with an escaped\' character' 2. 'this is a string {{functionName(x)}} literal with double curlies'

StringLiteral 
: '\'' (ESC | AnyExceptDblCurlies)*? '\'' ;

fragment 
ESC : '\\' [btnr\'\\];

fragment 
AnyExceptDblCurlies 
: '{' ~'{' 
| ~'{' .;

I've done a lot of research on this and understand that you can't negate multiple characters, and have even seen a similar approach work in Bart's answer in this post...

Negating inside lexer- and parser rules

But what I'm seeing is that in example 1 above, the escaped single quote is not being recognized and I get a parser error that it cannot match ' character'.

if I alter the string literal token rule to the following it works...

StringLiteral 
: '\'' (ESC | .)*? '\'' ;

Any ideas how to handle this scenario better? I can deduce that the escaped character is getting matched by AnyExceptDblCurlies instead of ESC, but I'm not sure how to solve this problem.


Solution

  • To parse the template definition out of the string pretty much requires handling in the parser. Use lexer modes to distinguish between string characters and the template name.

    Parser:

    options {
        tokenVocab = TesterLexer ;
    }
    
    test : string EOF ;
    string   : STRBEG ( SCHAR | template )* STREND ; // allow multiple templates per string
    template : TMPLBEG TMPLNAME TMPLEND ;
    

    Lexer:

    STRBEG : Squote -> pushMode(strMode) ;
    
    mode strMode ;
        STRESQ  : Esqote  -> type(SCHAR) ; // predeclare SCHAR in tokens block
        STREND  : Squote  -> popMode ;
        TMPLBEG : DBrOpen -> pushMode(tmplMode) ;
        STRCHAR : .       -> type(SCHAR) ;
    
    mode tmplMode ;
        TMPLEND  : DBrClose  -> popMode ;
        TMPLNAME : ~'}'*  ;
    
    fragment Squote : '\''   ;
    fragment Esqote : '\\\'' ;
    fragment DBrOpen   : '{{' ;
    fragment DBrClose  : '}}' ;
    

    Updated to correct the TMPLNAME rule, add main rule and options block.