antlrantlr4

Reading all characters until occurence of ; but noch enclosed by ""


Ok... i have the following problem:

i need to parse (or tokenize) the following text

ASK "Hey dude, what's about \";\"" + "?";
ASK "How old are you?" INTO inAge;
ASK "This is a
multiline String with \";\";" + " can you parse this?"; ANSWER "Sure, i can!";

in lexer, i tried it with modes:

ASK     : 'ASK' -> pushMode(UNTILSEMI) ;
ANSWER  : 'ANSWER' -> pushMode(UNTILSEMI) ;

mode UNTILSEMI;
ENDSEMI   : ';'+ -> popMode ;
CONTENT   : ~[;]+ ;

the parser will be:

askStmt: ASK CONTENT ENDSEMI;
answerStmt: ASNWER CONTENT ENDSEMI;

my Problem: when there a semicolons inside of "strings", the tokenizer stops and the parser wont work..

i have no idea how to start. should i manipulate the lexer directly? can i do this with lexer-modes?


Solution

  • I don't see the need for lexical modes. Something like this would handle your example input correctly:

    parse
     : ( question | answer )* EOF
     ;
    
    question
     : ASK expression ( INTO ID )? SEMI
     ;
    
    answer
     : ANSWER expression SEMI
     ;
    
    expression
     : expression PLUS expression
     | STRING
     | ID
     ;
    
    ASK    : 'ASK';
    ANSWER : 'ANSWER';
    INTO   : 'INTO';
    ID     : [a-zA-Z]+;
    PLUS   : '+';
    SEMI   : ';';
    SPACES : [ \t\r\n]+ -> skip;
    STRING : '"' ( ~[\\"] | '\\' . )* '"';
    

    EDIT

    Even without expressions, so only a few tokens, I don't see the need for lexical modes:

    parse
     : ( question | answer )* EOF
     ;
    
    question
     : ASK ~SEMI* SEMI OTHER*
     ;
    
    answer
     : ANSWER ~SEMI* SEMI OTHER*
     ;
    
    ASK    : 'ASK';
    ANSWER : 'ANSWER';
    SEMI   : ';';
    STRING : '"' ( ~[\\"] | '\\' . )* '"';
    OTHER  : ~[";];
    

    which will parse your example input as follows:

    enter image description here