Search code examples
stringunicodeutf-8javacc

Non-English Tokens in JavaCC


I tried this link already: Print in JavaCC. But for some unknown reason that answer didn't work for me. I copied and pasted the text to a file and ran it, but when I inputed µ, for example, it didn't print anything.

I want to be able to use Non-English in my string token. Just for testing purposes, right now I have:

options 
{
    UNICODE_INPUT = true;
    JAVA_UNICODE_ESCAPE = false;
}

PARSER_BEGIN(Unicode)

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;

public class Unicode
{
    public static void main(String[] args)
    {
        if(args.length == 0)
        {
            System.out.println("File name not specified!");
            return;
        }

        System.out.println("-----Start-----\n\n");
        try
        {
            FileInputStream fis = new FileInputStream(args[0]);
            InputStreamReader isr = new InputStreamReader(fis, "UTF8");

            Unicode parser = new Unicode(isr);
            parser.start();
        }
        catch(FileNotFoundException ex){
            System.out.println(ex);
        }
        catch(UnsupportedEncodingException ex){
            System.out.println(ex);
        }
        catch(ParseException ex){
            System.out.println(ex);
        }
        catch(TokenMgrError ex){
            System.out.println(ex);
        }
        System.out.println("\n\n------End-------");
    }
}

PARSER_END(Unicode)

TOKEN:{
    //         á          é          í          ó          ú
    <STR: ("\u00e1" | "\u00e9" | "\u00ed" | "\u00f3" | "\u00fa")>
}

void start():
{
    Token found;
}
{
    (
        found = <STR>
        {System.out.println("Input: " + found.image);}
    )+

    <EOF>
}

When I run the parser and feed it a file containing á, é, í, ó, ú, all I get is a bunch of question marks.

Input: ?
Input: ?
Input: ?
Input: ?
Input: ?

I've read something about having to modify the char stream files that are automatically generated, but I don't really understand that.


Solution

  • This is an encoding problem between the default encoding used for a Java PrintStream, and the settings in the command shell that affect standard output.

    As the InputStream encoding is specified explicitly, and the input apparently is parsed OK, the problem is not related to JavaCC. It thus should be reproducible by this, too:

      System.out.println("\u00e1\u00e9\u00ed\u00f3\u00fa");
    

    The encoding used by the System.out PrintStream is taken from system property "file.encoding", which on my Windows system defaults to "Cp1252" (i.e. Windows-1252). It can be forced to use something different by setting "file.encoding", e.g.

      java -dfile.encoding=UTF-8 Unicode
    

    Also the standard PrintStream could be replaced by one that uses a different encoding:

      System.setOut(new PrintStream(System.out, true, "UTF-8"));
    

    Either of the above will force the output to be generated in a specified encoding. However when displaying the results on a console, it is important to realize what encoding is used by the shell. My Windows defaults to Cp850, and the encoding can be modified by using the chcp command. The above println will produce correct graphics by using "Windows-1252" from Java and chcp 1250.