I want to define a lexer rule for ranges between unicode characters that have code points that need more than four hexadecimal digits to identify. To be concrete, I want to declare the following rule:
ID_Continue : [\uE0100-\uE01EF] ;
Unfortunately, it doesn't work. This rule will match characters that are not in this range. (I'm not certain to what exact behaviour this results in, but it isn't the one I want.) I've tried also the following (padding with leading zeros and using 8 digits):
ID_Continue : [\U000E0100-\U000E01EF] ;
But it seems to result in the same unwanted behaviour.
I am using Antlr4 and the IntelliJ plugin for it for testing.
Does Antlr4 not support unicode literals above \uFFFF
?
No, ANTLR's max is the same as Java's Character.MAX_VALUE
If you look at (a part of) ANTLR4's lexer grammar you will see these rules:
// Any kind of escaped character that we can embed within ANTLR literal strings.
fragment EscSeq
: Esc
( [btnfr"'\\] // The standard escaped character set such as tab, newline, etc.
| UnicodeEsc // A Unicode escape sequence
| . // Invalid escape character
| EOF // Incomplete at EOF
)
;
...
fragment UnicodeEsc
: 'u' (HexDigit (HexDigit (HexDigit HexDigit?)?)?)?
;
...
fragment Esc : '\\' ;