Search code examples
javac#parsingantlr4antlr4cs

Java Grammar from Github for ANTLR4 and C# target


I have given up fixing C# grammar from version for ANTLR3.2 to version for ANTLR4, now I want to make Java Parser and Visitor. The Java grammar for ANTLR4 downloaded from Github: https://github.com/antlr/grammars-v4/blob/master/java/Java.g4 is written for any target language, but some code is for Java target and it does not work with C#. I am talking about these lexar rules:

fragment
JavaLetter
:   [a-zA-Z$_] // these are the "java letters" below 0xFF
|   // covers all characters above 0xFF which are not a surrogate
    ~[\u0000-\u00FF\uD800-\uDBFF]
   // {Character.isJavaIdentifierStart(_input.LA(-1))}?
|   // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
    [\uD800-\uDBFF] [\uDC00-\uDFFF]
    //{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char) _input.LA  (-1)))}?
;

fragment
JavaLetterOrDigit
:   [a-zA-Z0-9$_] // these are the "java letters or digits" below 0xFF
|   // covers all characters above 0xFF which are not a surrogate
    ~[\u0000-\u00FF\uD800-\uDBFF]
   // {Character.isJavaIdentifierPart(_input.LA(-1))}?
|   // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
    [\uD800-\uDBFF] [\uDC00-\uDFFF]
    //{char.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;

I have commented the target codes starting with {Character.isJavaIdentifier...} and it is now OK. I was wondering why is it there!? I think it returns true if the token before or 2 tokens before (in case LA(-2)) are IdentifierPart, but what is the action code for? In C# Char object does not support static method isIdentifierPart or something like that...

My question is: If I cancel the action code, will the parser fail on a specific identifier name during the parsing of a Java input code? If YES, how can I substitute it for C# target?

Thanks for replies! PK


Solution

  • In the Java Language Specification §3.8, an identifier is defined in terms of two static methods on the Character class.

    Identifier:
      IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral
    
    IdentifierChars:
      JavaLetter {JavaLetterOrDigit}
    
    JavaLetter:
      any Unicode character that is a "Java letter"
    
    JavaLetterOrDigit:
      any Unicode character that is a "Java letter-or-digit"
    

    A "Java letter" is a character for which the method Character.isJavaIdentifierStart(int) returns true.

    A "Java letter-or-digit" is a character for which the method Character.isJavaIdentifierPart(int) returns true.

    The grammar implements this in a specific manner designed to maximize performance for the expected inputs. In particular, the most well known characters from the set [a-zA-Z0-9_$] (Regular Expression syntax) are handled directly by the grammar. The language specification guarantees that this set will always be considered identifier characters.

    ANTLR 4 does not cache DFA transitions for UTF-16 code units above U+007F, so anything outside the previously described set is on a "slow" path for the lexer anyway. Rather than bloat the size of the state machine, these characters are handled using a clean and simple semantic predicate.

    If your source code does not use code points above U+007F for Unicode identifiers, then you can safely reduce the grammar to the following:

    fragment
    JavaLetter
      :   [a-zA-Z$_] // these are the "java letters" below 0xFF
      ;
    
    fragment
    JavaLetterOrDigit
      :   [a-zA-Z0-9$_] // these are the "java letters or digits" below 0xFF
      ;
    

    Otherwise, for complete support you can use the Java-LR.g4 grammar from the C# target (rename to Java.g4 before using).