I have given up fixing C# grammar from version for ANTLR3.2 to version for ANTLR4, now I want to make Java Parser and Visitor. The Java grammar for ANTLR4 downloaded from Github: https://github.com/antlr/grammars-v4/blob/master/java/Java.g4 is written for any target language, but some code is for Java target and it does not work with C#. I am talking about these lexar rules:
fragment
JavaLetter
: [a-zA-Z$_] // these are the "java letters" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
// {Character.isJavaIdentifierStart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
//{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char) _input.LA (-1)))}?
;
fragment
JavaLetterOrDigit
: [a-zA-Z0-9$_] // these are the "java letters or digits" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
// {Character.isJavaIdentifierPart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
//{char.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
I have commented the target codes starting with {Character.isJavaIdentifier...} and it is now OK. I was wondering why is it there!? I think it returns true if the token before or 2 tokens before (in case LA(-2)) are IdentifierPart, but what is the action code for? In C# Char object does not support static method isIdentifierPart or something like that...
My question is: If I cancel the action code, will the parser fail on a specific identifier name during the parsing of a Java input code? If YES, how can I substitute it for C# target?
Thanks for replies! PK
In the Java Language Specification §3.8, an identifier is defined in terms of two static methods on the Character
class.
Identifier: IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral IdentifierChars: JavaLetter {JavaLetterOrDigit} JavaLetter: any Unicode character that is a "Java letter" JavaLetterOrDigit: any Unicode character that is a "Java letter-or-digit"
A "Java letter" is a character for which the method
Character.isJavaIdentifierStart(int)
returns true.A "Java letter-or-digit" is a character for which the method
Character.isJavaIdentifierPart(int)
returns true.
The grammar implements this in a specific manner designed to maximize performance for the expected inputs. In particular, the most well known characters from the set [a-zA-Z0-9_$]
(Regular Expression syntax) are handled directly by the grammar. The language specification guarantees that this set will always be considered identifier characters.
ANTLR 4 does not cache DFA transitions for UTF-16 code units above U+007F, so anything outside the previously described set is on a "slow" path for the lexer anyway. Rather than bloat the size of the state machine, these characters are handled using a clean and simple semantic predicate.
If your source code does not use code points above U+007F for Unicode identifiers, then you can safely reduce the grammar to the following:
fragment
JavaLetter
: [a-zA-Z$_] // these are the "java letters" below 0xFF
;
fragment
JavaLetterOrDigit
: [a-zA-Z0-9$_] // these are the "java letters or digits" below 0xFF
;
Otherwise, for complete support you can use the Java-LR.g4 grammar from the C# target (rename to Java.g4 before using).