Search code examples
javaunicodecompiler-constructionantlrjavacc

JavaCC and Unicode issue. Why \u696d cannot be managed in JavaCC although it belong to the range "\u4e00"-"\u9fff"


We're trying to use JavaCC as a parser to parse source code which is in UTF-8( the language is Japanese). In JavaCC, we have a declaration like:

< #LETTER:
  [
   "\u0024",
   "\u0041"-"\u005a",
   "\u005f",
   "\u0061"-"\u007a",
   "\u00c0"-"\u00d6",
   "\u00d8"-"\u00f6",
   "\u00f8"-"\u00ff",
   "\u0100"-"\u1fff",
   "\u3040"-"\u318f",
   "\u3300"-"\u337f",
   "\u3400"-"\u3d2d",
   "\u4e00"-"\u9fff",
   "\uf900"-"\ufaff"
  ]
>

If it meets a string like "日建フェンス工業", it will fail because of 業 character. If I remove it, it works as expected. The code of 業 character is "\u696d", and as you can see in the declaration, it should belong to the range "\u4e00"-"\u9fff"

Any suggestion on this?

PS: If we rewrite this grammar using Antlr, how does it look like

Thank you so much


Solution

  • There is nothing wrong with your token fragment and nothing wrong with JavaCC. The problem lies elsewhere.

    Here is a JavaCC specification made by copying and pasting your problem code into JavaCC.

    options {
      static = true;
      debug_token_manager = true ; }
    
    PARSER_BEGIN(MyNewGrammar)
    package funnyunicode;
    import java.io.StringReader ;
    
    public class MyNewGrammar
    {
      public static void main(String args []) throws ParseException
      {
        MyNewGrammar parser = new MyNewGrammar(new StringReader("日建フェンス工業"));
        MyNewGrammar.go() ;
        System.out.println("OK."); } }
    PARSER_END(MyNewGrammar)
    
    TOKEN :
    {
      < WORD : (<LETTER>)+ >
    |
      < #LETTER:
      [
       "\u0024",
       "\u0041"-"\u005a",
       "\u005f",
       "\u0061"-"\u007a",
       "\u00c0"-"\u00d6",
       "\u00d8"-"\u00f6",
       "\u00f8"-"\u00ff",
       "\u0100"-"\u1fff",
       "\u3040"-"\u318f",
       "\u3300"-"\u337f",
       "\u3400"-"\u3d2d",
       "\u4e00"-"\u9fff",
       "\uf900"-"\ufaff"
      ] >
    }
    
    void go() :
    {Token tk ; }
    {
      tk=<WORD> <EOF>
    }
    

    And here is the output from the resulting Java program

    Current character : \u65e5 (26085) at line 1 column 1
       Starting NFA to match one of : { <WORD> }
    Current character : \u65e5 (26085) at line 1 column 1
       Currently matched the first 1 characters as a <WORD> token.
       Possible kinds of longer matches : { <WORD> }
    Current character : \u5efa (24314) at line 1 column 2
       Currently matched the first 2 characters as a <WORD> token.
       Possible kinds of longer matches : { <WORD> }
    Current character : \u30d5 (12501) at line 1 column 3
       Currently matched the first 3 characters as a <WORD> token.
       Possible kinds of longer matches : { <WORD> }
    Current character : \u30a7 (12455) at line 1 column 4
       Currently matched the first 4 characters as a <WORD> token.
       Possible kinds of longer matches : { <WORD> }
    Current character : \u30f3 (12531) at line 1 column 5
       Currently matched the first 5 characters as a <WORD> token.
       Possible kinds of longer matches : { <WORD> }
    Current character : \u30b9 (12473) at line 1 column 6
       Currently matched the first 6 characters as a <WORD> token.
       Possible kinds of longer matches : { <WORD> }
    Current character : \u5de5 (24037) at line 1 column 7
       Currently matched the first 7 characters as a <WORD> token.
       Possible kinds of longer matches : { <WORD> }
    Current character : \u696d (26989) at line 1 column 8
       Currently matched the first 8 characters as a <WORD> token.
       Possible kinds of longer matches : { <WORD> }
    ****** FOUND A <WORD> MATCH (\u65e5\u5efa\u30d5\u30a7\u30f3\u30b9\u5de5\u696d) ******
    
    Returning the <EOF> token.
    
    OK.
    

    As you can see the generated tokenizer has no trouble seeing \u696d as a LETTER.