Search code examples
parsingantlr4velocity

How to parsing Velocity Variables using ANTLR4


The variables of Velocity has following notation. (see Velocity User Guide):

The shorthand notation of a variable consists of a leading "$" character followed by a VTL Identifier. A VTL Identifier must start with an alphabetic character (a .. z or A .. Z). The rest of the characters are limited to the following types of characters:

  • alphabetic (a .. z, A .. Z)
  • numeric (0 .. 9)
  • underscore ("_")

I want to use lexer mode to split the normal text and the variables, so I wrote something like this:

// default mode
DOLLAR : ‘$’ -> pushMode(VARIABLE);
TEXT : ~[$]+? -> skip;

mode VARIABLE:
ID : [a-zA-Z] [a-zA-Z0-9-_]*;
???? : XXX -> popMode;   // how can I pop mode to default?

Because the notation of the variables has no explicit end character, so I don't know how to determine its end.

Maybe I got it wrong?


Solution

  • You would pop out of that scope like this:

    mode VARIABLE;
      ID  : [a-zA-Z] [a-zA-Z0-9-_]* -> popMode;
    

    Here's a quick demo:

    lexer grammar VelocityLexer;
    
    DOLLAR : '$' -> more, pushMode(VARIABLE);
    TEXT   : ~[$]+ -> skip;
    
    mode VARIABLE;
      // the `-` needs to be escaped!
      ID : [a-zA-Z] [a-zA-Z0-9\-_]* -> popMode;
    

    Note the more in the DOLLAR which will cause the $ to be included in the ID token. If you don't, you end up with two tokens ($ and foo for the input $foo)

    Test the grammar with the following Java class:

    import org.antlr.v4.runtime.*;
    
    public class Main {
    
      public static void main(String[] args) {
    
        VelocityLexer lexer = new VelocityLexer(CharStreams.fromString("<strong>$Mu</strong>$foo..."));
        CommonTokenStream tokenStream = new CommonTokenStream(lexer);
        tokenStream.fill();
    
        for (Token t : tokenStream.getTokens()) {
          System.out.printf("%-10s '%s'\n", VelocityLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
        }
      }
    }
    

    which will print:

    ID         '$Mu'
    ID         '$foo'
    EOF        '<EOF>'
    

    However, I think a lexical mode is not a good choice in case of an ID. Why not simply do:

    lexer grammar VelocityLexer;
    
    DOLLAR : '$' [a-zA-Z] [a-zA-Z0-9\-_]*;
    TEXT   : ~[$]+ -> skip;
    

    ?