I was writing a function in java that can read file and get its content to String:
public static String ReadFromFile(String fileLocation) {
StringBuilder result = new StringBuilder();
RandomAccessFile randomAccessFile = null;
FileChannel fileChannel = null;
try {
randomAccessFile = new RandomAccessFile(fileLocation, "r");
fileChannel = randomAccessFile.getChannel();
ByteBuffer byteBuffer = ByteBuffer.allocate(10);
CharBuffer charBuffer = null;
int bytesRead = fileChannel.read(byteBuffer);
while (bytesRead != -1) {
byteBuffer.flip();
charBuffer = StandardCharsets.UTF_8.decode(byteBuffer);
result.append(charBuffer.toString());
byteBuffer.clear();
bytesRead = fileChannel.read(byteBuffer);
}
} catch (IOException ignored) {
} finally {
try {
if (fileChannel != null)
fileChannel.close();
if (randomAccessFile != null)
randomAccessFile.close();
} catch (IOException ignored) {
}
}
return result.toString();
}
From code above you can see that I set 'ByteBuffer.allocate' only 10 bytes on purpose to make things clearer. Now I want to read a file named "test.txt" that contains unicode charaters in Chinese like this:
乐正绫我爱你乐正绫我爱你
Below is my test code for it:
System.out.println(ReadFromFile("test.txt"));
Expected Output in Console
乐正绫我爱你乐正绫我爱你
Actual Output in Console
乐正绫���爱你��正绫我爱你
Possible Reason
ByteBuffer only allocated 10 bytes, thus unicode characters are truncated every 10 bytes.
Attempt To Solve
Increase ByteBuffer allocated bytes to 20, I got the result below:
乐正绫我爱你��正绫我爱你
Not A Robust Solution
Allocate ByteBuffer to a very huge number, like 102400, but it is not practical when it comes to very huge text files.
Question
How to solve this problem?
You can't, since you don't know how many bytes are used for each character in UTF-8 encoding, and you really don't want to rewrite that logic.
There's Files.readString() in Java 11, for lower versions you can use Files.readAllBytes() e.g.
Path path = new File(fileLocation).toPath()
String contents = new String(Files.readAllBytes(path), StandardCharsets.UTF_8);