Search code examples
javaencodingutf-8iso-8859-1

Java reading file containing japanese characters


I am trying to read a file which contains some japanese characters.

RandomAccessFile file = new RandomAccessFile("japanese.txt", "r");
String line;
while ((line = file.readLine()) != null) {
   System.out.println(line);
}

Its returning some garbled characters instead of japanese. But when I am converting the encoding, it printing it properly.

line = new String(line.getBytes("ISO-8859-1"), "UTF-8");

What does this mean? Is the text file in ISO-8859-1 encoding?

$ file -i japanese.txt returns following:

japanese.txt: text/plain; charset=utf-8

Please explain which it explicitely requires the file to convert from Latin 1 to UTF-8?


Solution

  • No, readString is an obsolete method, still before charsets/encodings and such. It turns every byte into a char with high byte 0. Byte 0x85 is a line separator (EBCDIC NEL), and if that were in some UTF-8 multibyte sequence, the actual line would be broken into two lines. And some more scenarios are feasible.

    Best use Files. It has a newBufferedReader(path, Charset) and a fixed default charset UTF-8.

    Path path = Paths.get("japanese.txt");
    try (BufferedReader file = Files.newBufferedReader(path)) {
        String line;
        while ((line = file.readLine()) != null) {
            System.out.println(line);
        }
    }
    

    Now you'll read correct Strings.

    A RandomAccessFile basically is for binary data.