We're trying to use JavaCC as a parser to parse source code which is in UTF-8( the language is Japanese). In JavaCC, we have a declaration like:
< #LETTER:
[
"\u0024",
"\u0041"-"\u005a",
"\u005f",
"\u0061"-"\u007a",
"\u00c0"-"\u00d6",
"\u00d8"-"\u00f6",
"\u00f8"-"\u00ff",
"\u0100"-"\u1fff",
"\u3040"-"\u318f",
"\u3300"-"\u337f",
"\u3400"-"\u3d2d",
"\u4e00"-"\u9fff",
"\uf900"-"\ufaff"
]
>
If it meets a string like "日建フェンス工業", it will fail because of 業 character. If I remove it, it works as expected. The code of 業 character is "\u696d", and as you can see in the declaration, it should belong to the range "\u4e00"-"\u9fff"
Any suggestion on this?
PS: If we rewrite this grammar using Antlr, how does it look like
Thank you so much
There is nothing wrong with your token fragment and nothing wrong with JavaCC. The problem lies elsewhere.
Here is a JavaCC specification made by copying and pasting your problem code into JavaCC.
options {
static = true;
debug_token_manager = true ; }
PARSER_BEGIN(MyNewGrammar)
package funnyunicode;
import java.io.StringReader ;
public class MyNewGrammar
{
public static void main(String args []) throws ParseException
{
MyNewGrammar parser = new MyNewGrammar(new StringReader("日建フェンス工業"));
MyNewGrammar.go() ;
System.out.println("OK."); } }
PARSER_END(MyNewGrammar)
TOKEN :
{
< WORD : (<LETTER>)+ >
|
< #LETTER:
[
"\u0024",
"\u0041"-"\u005a",
"\u005f",
"\u0061"-"\u007a",
"\u00c0"-"\u00d6",
"\u00d8"-"\u00f6",
"\u00f8"-"\u00ff",
"\u0100"-"\u1fff",
"\u3040"-"\u318f",
"\u3300"-"\u337f",
"\u3400"-"\u3d2d",
"\u4e00"-"\u9fff",
"\uf900"-"\ufaff"
] >
}
void go() :
{Token tk ; }
{
tk=<WORD> <EOF>
}
And here is the output from the resulting Java program
Current character : \u65e5 (26085) at line 1 column 1
Starting NFA to match one of : { <WORD> }
Current character : \u65e5 (26085) at line 1 column 1
Currently matched the first 1 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u5efa (24314) at line 1 column 2
Currently matched the first 2 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u30d5 (12501) at line 1 column 3
Currently matched the first 3 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u30a7 (12455) at line 1 column 4
Currently matched the first 4 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u30f3 (12531) at line 1 column 5
Currently matched the first 5 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u30b9 (12473) at line 1 column 6
Currently matched the first 6 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u5de5 (24037) at line 1 column 7
Currently matched the first 7 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u696d (26989) at line 1 column 8
Currently matched the first 8 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
****** FOUND A <WORD> MATCH (\u65e5\u5efa\u30d5\u30a7\u30f3\u30b9\u5de5\u696d) ******
Returning the <EOF> token.
OK.
As you can see the generated tokenizer has no trouble seeing \u696d
as a LETTER
.