Search code examples
antlrantlr4lexer

String augmentation and concatenation in ANTLR


I am having issues with ANTLR augmented strings. My main issue is if I want augmentedStrings to be read right, i have to keep string as a parser rule. But this causes the string to have quote, body, if there are any, body continued and end quote. This makes the implementation of a visitor more difficult. When i exchange the string rule to string: STRING; STRING: QUOTE ( ESCAPE_SEQUENCE | .)*? QUOTE; it breaks the augmentedString, which is already "hackish". Down below is the grammar. Any suggestions would be helpful. The test command used is: antlr4-parse strings.g4 root -gui <test1.txt and the test1.txt holds: string a = "normal string" + $"{id}";

grammar strings;

// Comments and white space
WS: [ \t\r\n]+ -> skip;

// key words
PLUS: '+';
// Symbols
QUESTION: '?';
LPAREN: '(';
RPAREN: ')';
LCURLY: '{';
RCURLY: '}';
SEMI: ';';
NEWLINE: '\n';
ASSIGN: '=';
QUOTE: '"';
DOLLAR: '$';

// Types
STRING_T: 'string';
type: STRING_T;

value:
    | augmentedString
    | concatanatedString
    | string;

//this has to be changed to a lexer rule inorder to not have a child of every thing
string: 
QUOTE ( ESCAPE_SEQUENCE | .)*? QUOTE;
 
augmentedString:
    DOLLAR QUOTE (( ESCAPE_SEQUENCE | .)?( LCURLY expr RCURLY) | ( ESCAPE_SEQUENCE | .)( LCURLY expr RCURLY)?  ) * QUOTE;

concatanatedString: (id | augmentedString | string ) (PLUS (id | augmentedString | string))*;

ESCAPE_SEQUENCE:
    '\\' (('\\' | '\'' | '"' ) | UNICODE_ESCAPE);
fragment UNICODE_ESCAPE:
    'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT;
fragment HEX_DIGIT: [0-9a-fA-F];
// Identifiers
ID: [a-zA-Z_][a-zA-Z_0-9]*;
id: ID;
argument: type id; //maybe replace all with the right hand side.

field: type? id ASSIGN expr SEMI;

// Expressions
expr:
     id
    |value;
// Statements
statementList: statement*;

statement:
    field;


// Add a start rule for testing
root: (
    field
    )*;

I have attempted to change the string rule to a parser rule, with the extra step, as well as a bunch of hackish solutions such as splitting the string into a quote and a string body and having a seperate end body.


Solution

  • You could make use of ANTLR's lexer modes. When you encounter a $", you switch to a special string mode and create tokens specific for an interpolated string. Inside that mode, you again switch to a "code mode" when you encounter a {. You pop back from this code mode after encountering a } and again pop back to the default mode when you encounter a ".

    To use modes, you must separate the lexer- and parser grammars in separate files. A quick demo:

    // File: ModeDemoLexer.g4
    lexer grammar ModeDemoLexer;
    
    PLUS           : '+';
    SIMPLE_STRING  : '"' ~["\r\n]* '"';
    STRING_START   : '$"' -> pushMode(STRING_MODE);
    SPACES         : S+ -> skip;
    
    fragment S     : [ \t\r\n];
    
    mode STRING_MODE;
    
    STRING_END     : '"' -> popMode;
    CODE_START     : '{' -> pushMode(CODE_MODE);
    STRING_ATOM    : ~["{];
    
    mode CODE_MODE;
    
    CODE_MODE_SKIP : S+ -> skip;
    ID             : [a-zA-Z_] [a-zA-Z_0-9]*;
    CODE_END       : '}' -> popMode;
    

    and:

    // File: ModeDemoParser.g4
    parser grammar ModeDemoParser;
    
    options {
      tokenVocab=ModeDemoLexer;
    }
    
    parse
     : expr EOF
     ;
    
    expr
     : expr PLUS expr
     | string
     | SIMPLE_STRING
     ;
    
    string
     : STRING_START string_atom* STRING_END
     ;
    
    string_atom
     : STRING_ATOM
     | CODE_START code_expr CODE_END
     ;
    
    code_expr
     : ID
     ;
    

    If you now parse "normal string" + $"id: { id }", you get the following result:

    enter image description here

    And without modes, you could do something like this:

    parse
     : expr EOF
     ;
    
    expr
     : expr PLUS expr
     | string
     | '(' expr ')'
     | SIMPLE_STRING
     | ID
     ;
    
    string
     : STRING_START expr ( STRING_MIDDLE expr )* STRING_END
     ;
    
    PLUS          : '+';
    SIMPLE_STRING : '"' ~["\r\n]* '"' | '$"' ~["\r\n{]* '"';
    STRING_START  : '$"' ~["{]* '{';
    STRING_MIDDLE : '}' ~["{]* '{';
    STRING_END    : '}' ~["{]* '"';
    SPACES        : [ \t\r\n]+ -> skip;
    ID            : [a-zA-Z_] [a-zA-Z_0-9]*;
    

    Parsing input "normal string" + $"id: { id }, x + y = { (x + y) }" would then result in:

    enter image description here