Search code examples
whitespaceantlrantlr4grammartemplate-strings

antlr grammar: Allow whitespace matching only in template string


I want to parse template strings:

`Some text ${variable.name} and so on ... ${otherVariable.function(parameter)} ...`

Here is my grammar:

varname: VAR ;
variable: varname funParameter? ('.' variable)* ;
templateString: '`' (TemplateStringLiteral* '${' variable '}' TemplateStringLiteral*)+ '`' ;
funParameter: '(' variable? (',' variable)*  ')' ;

WS      : [ \t\r\n\u000C]+ -> skip ;
TemplateStringLiteral: ('\\`' | ~'`') ;
VAR : [$]?[a-zA-Z0-9_]+|[$] ;

When the input for the grammar is parsed, the template string has no whitespaces anymore because of the WS -> skip. When I put the TemplateStringLiteral before WS, I get the error:

extraneous input ' ' expecting {'`'}

How can I allow whitespaces to be parsed and not skipped only inside the template string?


Solution

  • What is currently happening

    When testing your example against your current grammar displaying the generated tokens, the lexer gives this:

    [@0,0:0='`',<'`'>,1:0]
    [@1,1:4='Some',<VAR>,1:1]
    [@2,6:9='text',<VAR>,1:6]
    [@3,11:12='${',<'${'>,1:11]
    [@4,13:20='variable',<VAR>,1:13]
    [@5,21:21='.',<'.'>,1:21]
    [@6,22:25='name',<VAR>,1:22]
    [@7,26:26='}',<'}'>,1:26]
    ... shortened ...
    [@26,85:84='<EOF>',<EOF>,2:0]
    

    This tells you, that Some which you intended to be TemplateStringLiteral* was actually lexed to be VAR. Why is this happening?

    As mentioned in this answer, antlr uses the longest possible match to create a token. Since your TemplateStringLiteral rule only matches single characters, but your VAR rule matches infinitely many, the lexer obviously uses the latter to match Some.

    What you could try (Spoiler: won't work)

    You could try to modify the rule like this:

    TemplateStringLiteral: ('\\`' | ~'`')+ ;
    

    so that it captures more than one character and therefore will be preferred. This has two reasons why it does not work:

    1. How would the lexer match anything to the VAR rule, ever?

    2. The TemplateStringLiteral rule now also matches ${ therefore prohibiting the correct recognition of the start of a template chunk.

    How to achieve what you actually want

    There might be another solution, but this one works:

    File MartinCup.g4:

    parser grammar MartinCup;
    
    options { tokenVocab=MartinCupLexer; }
    
    templateString
        : BackTick TemplateStringLiteral* (template TemplateStringLiteral*)+ BackTick
        ;
    
    template
        : TemplateStart variable TemplateEnd
        ;
    
    variable
        : varname funParameter? (Dot variable)*
        ;
    
    varname
        : VAR
        ;
    
    funParameter
        : OpenPar variable? (Comma variable)* ClosedPar
        ;
    

    File MartinCupLexer.g4:

    lexer grammar MartinCupLexer;
    
    BackTick : '`' ;
    
    TemplateStart
        : '${' -> pushMode(templateMode)
        ;
    
    TemplateStringLiteral
        : '\\`'
        | ~'`'
        ;
    
    mode templateMode;
    
    VAR
        : [$]?[a-zA-Z0-9_]+
        | [$]
        ;
    
    OpenPar : '(' ;
    ClosedPar : ')' ;
    Comma : ',' ;
    Dot : '.' ;
    
    TemplateEnd
        : '}' -> popMode;
    

    This grammar uses lexer modes to differentiate between the inside and the outside of the curly braces. The VAR rule is now only active after ${ has been encountered and only stays active until } is read. It thereby does not catch non-template text like Some.

    Notice that the use of lexer modes requires a split grammar (separate files for parser and lexer grammars). Since no lexer rules are allowed in a parser grammar, I had to introduce tokens for the parentheses, comma, dot and backticks.

    About the whitespaces

    I assume you want to keep whitespaces inside the "normal text", but not allow whitespace inside the templates. Therefore I simply removed the WS rule. You can always re-add it if you like.

    I tested your alternative grammar, where you put TemplateStringLiteral above WS, but contrary to your observation, this gives me:

    line 1:1 extraneous input 'Some' expecting {'${', TemplateStringLiteral}

    The reason for this is the same as above, Some is lexed to VAR.