Search code examples
antlr4language-design

Antlr4: Skip line when it start with * unless the second char is


In my input, a line start with * is a comment line unless it starts with *+ or *-. I can ignore the comments but need to get the others.

This is my lexer rules:

WhiteSpaces : [ \t]+;
Newlines    : [\r\n]+;
Commnent    : '*' .*? Newlines -> skip ;
SkipTokens  : (WhiteSpaces | Newlines) -> skip;

An example:

* this is a comment line
** another comment line
*+ type value

So, the first two are comment lines, and I can skip it. But I don't know to to define lexer/parser rule that can catch the last line.


Solution

  • Your SkipTokens lexer rule will never be matched because the rules WhiteSpaces and Newlines are placed before it. See this Q&A for an explanation how the lexer matches tokens: ANTLR Lexer rule only seems to work as part of parser rule, and not part of another lexer rule

    For it to work as you expect, do this:

    SkipTokens  : (WhiteSpaces | Newlines) -> skip;
    
    fragment WhiteSpaces : [ \t]+;
    fragment Newlines    : [\r\n]+;
    

    What a fragment is, check this Q&A: What does "fragment" mean in ANTLR?

    Now, for your question. You defined a Comment rule to always end with a line break. This means that there can't be a comment at the end of your input. So you should let a comment either end with a line break or the EOF.

    Something like this should do the trick:

    COMMENT
     : '*' ~[+\-\r\n] ~[\r\n]* // a '*' must be followed by something other than '+', '-' or a line break
     | '*' ( [\r\n]+ | EOF )   // a '*' is a valid comment if directly followed by a line break, or the EOF
     ;
    
    STAR_MINUS
     : '*-'
     ;
    
    STAR_PLUS
     : '*+'
     ;
    
    SPACES
     : [ \t\r\n]+ -> skip
     ;
    

    This, of course, does not mandate the * to be at the start of the line. If you want that, checkout this Q&A: Handle strings starting with whitespaces