Search code examples
javaunicodeioniofilereader

Java Nio ByteBuffer truncate unicode characters when buffer reaches its bound


I was writing a function in java that can read file and get its content to String:

public static String ReadFromFile(String fileLocation) {
    StringBuilder result = new StringBuilder();
    RandomAccessFile randomAccessFile = null;
    FileChannel fileChannel = null;
    try {
        randomAccessFile = new RandomAccessFile(fileLocation, "r");
        fileChannel = randomAccessFile.getChannel();
        ByteBuffer byteBuffer = ByteBuffer.allocate(10);
        CharBuffer charBuffer = null;
        int bytesRead = fileChannel.read(byteBuffer);
        while (bytesRead != -1) {
            byteBuffer.flip();
            charBuffer = StandardCharsets.UTF_8.decode(byteBuffer);
            result.append(charBuffer.toString());
            byteBuffer.clear();
            bytesRead = fileChannel.read(byteBuffer);
        }
    } catch (IOException ignored) {
    } finally {
        try {
            if (fileChannel != null)
                fileChannel.close();
            if (randomAccessFile != null)
                randomAccessFile.close();
        } catch (IOException ignored) {
        }
    }
    return result.toString();
}

From code above you can see that I set 'ByteBuffer.allocate' only 10 bytes on purpose to make things clearer. Now I want to read a file named "test.txt" that contains unicode charaters in Chinese like this:

乐正绫我爱你乐正绫我爱你

Below is my test code for it:

System.out.println(ReadFromFile("test.txt"));

Expected Output in Console

乐正绫我爱你乐正绫我爱你

Actual Output in Console

乐正绫���爱你��正绫我爱你

Possible Reason
ByteBuffer only allocated 10 bytes, thus unicode characters are truncated every 10 bytes.

Attempt To Solve
Increase ByteBuffer allocated bytes to 20, I got the result below:

乐正绫我爱你��正绫我爱你

Not A Robust Solution
Allocate ByteBuffer to a very huge number, like 102400, but it is not practical when it comes to very huge text files.

Question
How to solve this problem?


Solution

  • You can't, since you don't know how many bytes are used for each character in UTF-8 encoding, and you really don't want to rewrite that logic.

    There's Files.readString() in Java 11, for lower versions you can use Files.readAllBytes() e.g.

    Path path = new File(fileLocation).toPath()
    String contents = new String(Files.readAllBytes(path), StandardCharsets.UTF_8);