Search code examples
javacompiler-constructionantlr4lexer

How to understand the CHAR_LITERAL rule in java antlr4 lexer file


java antlr4 lexer file

CHAR_LITERAL:       '\'' (~['\\\r\n] | EscapeSequence) '\'';

fragment EscapeSequence
    : '\\' 'u005c'? [btnfr"'\\]
    | '\\' 'u005c'? ([0-3]? [0-7])? [0-7]
    | '\\' 'u'+ HexDigit HexDigit HexDigit HexDigit
    ;

Why '\r', '\n', ''', '' is excluded in first part,and '\b', '\t', '\f', '"' not excluded in first part?

If I change the rule to this, is it equivalent to the previous rule

CHAR_LITERAL:       '\'' (~['\\\r\n\b\f\t\"] | EscapeSequence) '\'';

Or change it to this

CHAR_LITERAL:       '\'' (~['\\] | EscapeSequence) '\'';

Solution

  • It's not trying to exclude something like:

    char x = `\r`;
    

    It's trying to exclude:

    char x = '
    ';
    

    That last one is illegal java. A ' (opening a char literal) can be followed by either an EscapeSequence, or a character, but by exception, not a newline character. (as in, literally pressing enter in your editor, not \n which isn't a newline, it's an escape sequence that represents a new line).

    In other words, after the single quote, any character is fine, EXCEPT backslash which needs to be excluded, as EscapeSequence handles this, and EXCEPT the literal unicode values 0D/0A (CR and LF, in antlrspeak, \r and \n).

    It gets a little confusing perhaps - just make sure you very very carefully count the backslashes:

    ['\\\r\n]
    

    That is excluding 4 unicode values, and only 4:

    • a single quote. char x = ''; is not legal java.
    • a backslash. Because the ANTLR grammar needs to hop on over to the EscapeSequence part to parse that. char x = '\'; is not legal.
    • a newline, either one (CR or LF) - because you can't actually start a new line in the middle of your char literal.

    In contrast, the escape sequences aren't looking for \n, they are looking for a backslash symbol and then the actual letter 'n'.