I came across some strange behavior with reading files in Java 8 and i'm wondering if someone can make sense of it.
Scenario:
Reading a malformed text file. By malformed i mean that it contains bytes that do not map to any unicode code points.
The code i use to create such a file is as follows:
byte[] text = new byte[1];
char k = (char) -60;
text[0] = (byte) k;
FileUtils.writeByteArrayToFile(new File("/tmp/malformed.log"), text);
This code produces a file that contains exactly one byte, which is not part of the ASCII table (nor the extended one).
Attempting to cat
this file produces the following output:
�
Which is the UNICODE Replacement Character. This makes sense because UTF-8 needs 2 bytes in order to decode non-ascii characters, but we only have one. This is the behavior i expect from my Java code as well.
Pasting some common code:
private void read(Reader reader) throws IOException {
CharBuffer buffer = CharBuffer.allocate(8910);
buffer.flip();
// move existing data to the front of the buffer
buffer.compact();
// pull in as much data as we can from the socket
int charsRead = reader.read(buffer);
// flip so the data can be consumed
buffer.flip();
ByteBuffer encode = Charset.forName("UTF-8").encode(buffer);
byte[] body = new byte[encode.remaining()];
encode.get(body);
System.out.println(new String(body));
}
Here is my first approach using nio
:
FileInputStream inputStream = new FileInputStream(new File("/tmp/malformed.log"));
read(Channels.newReader(inputStream.getChannel(), "UTF-8");
This produces the following exception:
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.Reader.read(Reader.java:100)
Which is not what i expected but also kind of makes sense, because this is actually a corrupt and an illegal file, and the exception is basically telling us it expected more bytes to be read.
And my second one (using regular java.io
):
FileInputStream inputStream = new FileInputStream(new File("/tmp/malformed.log"));
read(new InputStreamReader(inputStream, "UTF-8"));
This does not fail and produces the exact same output as cat
did:
�
Which also makes sense.
So my questions are:
Channels.newReader
(which returns a StreamDecoder
) and simply using the regular InputStreamReader
? Am i doing something wrong with how i read? Any clarifications would be much appreciated.
Thanks :)
The difference between the behaviour actually goes right down to the StreamDecoder and Charset classes. The InputStreamReader
gets a CharsetDecoder
from StreamDecoder.forInputStreamReader(..)
which does replacement on error
StreamDecoder(InputStream in, Object lock, Charset cs) {
this(in, lock,
cs.newDecoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE));
}
while the Channels.newReader(..)
creates the decoder with the default settings (i.e. report instead of replace, which results in an exception further up)
public static Reader newReader(ReadableByteChannel ch,
String csName) {
checkNotNull(csName, "csName");
return newReader(ch, Charset.forName(csName).newDecoder(), -1);
}
So they work differently, but there's no indication in documentation anywhere about the difference. This is badly documented, but I suppose they changed the functionality because you'd rather get an exception than have your data silently corrupted.
Be careful when dealing with character encodings!