Search code examples
antlr4

ANTLR4 grammar handling reserved keywords that appear inside "free-text" fields


I'm attempting to write an ANTLR4 grammar for lookml. The schema for this language is relatively straight forward but it has 2 wrinkles -- 1 is that it supports a templating language that can be used in most "properties" and 2 there are a few fields that allow for arbitrary sql expressions.

My issue has to do with how the lexer gets tokens in different contexts.

LookML has a property for case/when expressions for example:

dimension: query_type {
    type: string
    case: {
      when: {
        label: "SELECT Query"
        sql: ${name} ILIKE 'SELECT%'
          ;;
      }
      else: 'Other'
  }
}

so I have tokens in my lexer:

CASE: 'case';
WHEN: 'when';
ELSE: 'else';

But you can also have CASE/WHEN statements in the arbitrary SQL fields

dimension: full_name {
  type: string
  sql: case when true then 'its true' else 'its false' end ;;
}

My parser rule for the sql expression can't just "catch all" everything between the sql: and the ;; because there could be template variables that I want parsed. When the lexer runs it considers the CASE WHEN as the reserved keywords intended to match up with the case when properties. My sql property rule then needs to account for CASE | WHEN | ELSE and really any reserved keyword in lookml that could also find its way in arbitrary SQL code.

I've considered a few options:

  1. As I'm testing and developing I can add every possible token into my sql property parser rule and let the tokenizer think that those are tokens.
  2. Make the lexer treat everything between sql: and ;; as one big token and handle parsing the possible template values in the application code
  3. Making the tokens include the colon so CASE becomes 'case:'

Are any of these common approaches to this problem? This is my first grammar from scratch so I could be missing the point entirely here. I also tried looking into modes but I can't tell if that is actually the right application here.


Solution

  • When parsing a language inside a language (SQL inside LookML), you could use lexical modes. When using lexical modes, you'll need to separate the lexer- and parser-grammars.

    A quick demo:

    LookMLLexer.g4

    lexer grammar LookMLLexer;
    
    DIMENSION : 'dimension';
    SQL : 'sql' SPACE* ':' -> pushMode(SqlMode);
    CASE : 'case';
    WHEN : 'when';
    ELSE : 'else';
    
    COL : ':';
    OBRACE :  '{';
    CBRACE : '}';
    
    STRING : '"' .*? '"';
    
    ID : [a-zA-Z_] [a-zA-Z_0-9]*;
    COMMENT : '#' ~[\r\n]* -> skip;
    SPACES : SPACE+ -> skip;
    
    OTHER : .;
    
    fragment SPACE : [ \t\r\n];
    
    mode SqlMode;
    
    SCOL2 : ';;' -> popMode;
    SELECT options { caseInsensitive = true; } : 'select';
    FROM options { caseInsensitive = true; } : 'from';
    SQL_ID : ID -> type(ID);
    SQL_SPACES : SPACE+ -> skip;
    

    LookMLParser.g4

    parser grammar LookMLParser;
    
    options {
      tokenVocab=LookMLLexer;
    }
    
    parse
     : dimension EOF
     ;
    
    dimension
     : DIMENSION ':' ID '{' case_when key_value* '}'
     ;
    
    case_when
     : CASE ':' '{' when+ else '}'
     ;
    
    when
     : WHEN ':' '{' sql key_value* '}'
     ;
    
    else
     : ELSE ':' value
     ;
    
    key_value
     : ID ':' value
     ;
    
    value
     : STRING
     | ID
     ;
    
    sql
     : SQL sql_stat SCOL2
     ;
    
    sql_stat
     : SELECT ID FROM ID
     ;
    

    will parse the input:

    dimension: field_name {
      case: {
        when: {
          sql: SELECT a FROM b ;;
          label: "value"
        }
        # Possibly more when statements
        else: "value"
      }
      alpha_sort:  yes
    }
    

    as follows:

    enter image description here