Search code examples
parsingantlr4context-free-grammar

ANTLR4 - parse function-like structures in regular text


I'm experimenting with a grammar, which will be able to match function-like structures inside regular text. These functions starts with a dollar sign accept text arguments surrounded by apostrophes and allows nesting of other functions.

I was able to achieve this with more constrained conditions like every text has to be surrounded by apostrophes and concatenation with '+' character is available but I wanted to redesign it to work without this constraing.

I came up with this grammar:

grammar Functions;

fragment DIGIT : [0-9];
fragment LETTER : [A-Za-z];

FUNCTION_NAME : '$' LETTER (LETTER | DIGIT)+;

APOSTROPHE : '\'';
LEFT_PARENTHESIS  : '(';
RIGHT_PARENTHESIS : ')';

ESCAPE_CHARACTER: '\\' [$()\\'];
TEXT  : '\'' ~[\r\n']* '\'';

PLAIN_TEXT : . -> skip;

start : subString*;

subString: function
   | ESCAPE_CHARACTER
   | LEFT_PARENTHESIS 
   | RIGHT_PARENTHESIS
   | APOSTROPHE
   | TEXT
   ;

function
    : FUNCTION_NAME LEFT_PARENTHESIS param? RIGHT_PARENTHESIS
    ;

param
    : function
    | TEXT
    ;

But following example does not work well:

Text $func('A') 'text $func2()'

Because 'text $func2()' is matched as TEXT token. Therefore I came with escaping feature so adding \' solves the problem.

However, I'd like to make it work so that characters outside the function context are treated as regular characters. Because of this 'context' I'm starting to think that I've reached the limitations of context-free grammar but I don't have much practical experience to confirm that.

Is it possible to reach my requirements using ANTLR4?


Solution

  • This could work:

    FunctionsLexer.g4

    lexer grammar FunctionsLexer;
    
    FUNCTION_NAME : '$' LETTER (LETTER | DIGIT)* -> pushMode(InFunction);
    PLAIN_TEXT : . -> skip;
    
    mode InFunction;
    
    FUNCTION_NAME_NESTED
     : '$' LETTER (LETTER | DIGIT)* -> type(FUNCTION_NAME), pushMode(InFunction)
     ;
    
    PARAM : '\'' ~['$]* '\'';
    LEFT_PARENTHESIS  : '(';
    RIGHT_PARENTHESIS : ')' -> popMode;
    
    fragment DIGIT : [0-9];
    fragment LETTER : [A-Za-z];
    

    FunctionsParser.g4

    parser grammar FunctionsParser;
    
    options {
      tokenVocab=FunctionsLexer;
    }
    
    start
     : subString* EOF
     ;
    
    subString
     : function
     ;
    
    function
     : FUNCTION_NAME LEFT_PARENTHESIS param? RIGHT_PARENTHESIS
     ;
    
    param
     : function
     | PARAM
     ;
    

    The input Text $func('A') 'text $func2()' BLA $fun3($fun4('...')) produces 15 tokens:

      1    FUNCTION_NAME                  '$func'
      2    LEFT_PARENTHESIS               '('
      3    PARAM                          '\'A\''
      4    RIGHT_PARENTHESIS              ')'
      5    FUNCTION_NAME                  '$func2'
      6    LEFT_PARENTHESIS               '('
      7    RIGHT_PARENTHESIS              ')'
      8    FUNCTION_NAME                  '$fun3'
      9    LEFT_PARENTHESIS               '('
      10   FUNCTION_NAME                  '$fun4'
      11   LEFT_PARENTHESIS               '('
      12   PARAM                          '\'...\''
      13   RIGHT_PARENTHESIS              ')'
      14   RIGHT_PARENTHESIS              ')'
      15   EOF                            '<EOF>'
    

    and start produces the following parse tree:

    '- start
       |- subString
       |  '- function
       |     |- '$func' (FUNCTION_NAME)
       |     |- '(' (LEFT_PARENTHESIS)
       |     |- param
       |     |  '- '\'A\'' (PARAM)
       |     '- ')' (RIGHT_PARENTHESIS)
       |- subString
       |  '- function
       |     |- '$func2' (FUNCTION_NAME)
       |     |- '(' (LEFT_PARENTHESIS)
       |     '- ')' (RIGHT_PARENTHESIS)
       |- subString
       |  '- function
       |     |- '$fun3' (FUNCTION_NAME)
       |     |- '(' (LEFT_PARENTHESIS)
       |     |- param
       |     |  '- function
       |     |     |- '$fun4' (FUNCTION_NAME)
       |     |     |- '(' (LEFT_PARENTHESIS)
       |     |     |- param
       |     |     |  '- '\'...\'' (PARAM)
       |     |     '- ')' (RIGHT_PARENTHESIS)
       |     '- ')' (RIGHT_PARENTHESIS)
       '- '<EOF>'