Search code examples
unicodeantlrantlr4lexical-analysisunicode-literals

How do I specify a unicode literal that requires more than four hex digits in Antlr?


I want to define a lexer rule for ranges between unicode characters that have code points that need more than four hexadecimal digits to identify. To be concrete, I want to declare the following rule:

ID_Continue : [\uE0100-\uE01EF] ;

Unfortunately, it doesn't work. This rule will match characters that are not in this range. (I'm not certain to what exact behaviour this results in, but it isn't the one I want.) I've tried also the following (padding with leading zeros and using 8 digits):

ID_Continue : [\U000E0100-\U000E01EF] ;

But it seems to result in the same unwanted behaviour.

I am using Antlr4 and the IntelliJ plugin for it for testing.

Does Antlr4 not support unicode literals above \uFFFF?


Solution

  • No, ANTLR's max is the same as Java's Character.MAX_VALUE

    If you look at (a part of) ANTLR4's lexer grammar you will see these rules:

    // Any kind of escaped character that we can embed within ANTLR literal strings.
    fragment EscSeq
        :   Esc
            ( [btnfr"'\\]   // The standard escaped character set such as tab, newline, etc.
            | UnicodeEsc    // A Unicode escape sequence
            | .             // Invalid escape character
            | EOF           // Incomplete at EOF
            )
        ;
    
    ...
    
    fragment UnicodeEsc
        :   'u' (HexDigit (HexDigit (HexDigit HexDigit?)?)?)?
        ;
    
    ...
    
    fragment Esc : '\\' ;