Search code examples
javaencodingniomalformed

StreamDecoder vs InputStreamReader when reading malformed files


I came across some strange behavior with reading files in Java 8 and i'm wondering if someone can make sense of it.

Scenario:

Reading a malformed text file. By malformed i mean that it contains bytes that do not map to any unicode code points.

The code i use to create such a file is as follows:

byte[] text = new byte[1];
char k = (char) -60;
text[0] = (byte) k;
FileUtils.writeByteArrayToFile(new File("/tmp/malformed.log"), text);

This code produces a file that contains exactly one byte, which is not part of the ASCII table (nor the extended one).

Attempting to cat this file produces the following output:

Which is the UNICODE Replacement Character. This makes sense because UTF-8 needs 2 bytes in order to decode non-ascii characters, but we only have one. This is the behavior i expect from my Java code as well.

Pasting some common code:

private void read(Reader reader) throws IOException {

    CharBuffer buffer = CharBuffer.allocate(8910);

    buffer.flip();

    // move existing data to the front of the buffer
    buffer.compact();

    // pull in as much data as we can from the socket
    int charsRead = reader.read(buffer);

    // flip so the data can be consumed
    buffer.flip();

    ByteBuffer encode = Charset.forName("UTF-8").encode(buffer);
    byte[] body = new byte[encode.remaining()];
    encode.get(body);

    System.out.println(new String(body));
}

Here is my first approach using nio:

FileInputStream inputStream = new FileInputStream(new File("/tmp/malformed.log"));
read(Channels.newReader(inputStream.getChannel(), "UTF-8");

This produces the following exception:

java.nio.charset.MalformedInputException: Input length = 1

    at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    at java.io.Reader.read(Reader.java:100)

Which is not what i expected but also kind of makes sense, because this is actually a corrupt and an illegal file, and the exception is basically telling us it expected more bytes to be read.

And my second one (using regular java.io):

FileInputStream inputStream = new FileInputStream(new File("/tmp/malformed.log"));
read(new InputStreamReader(inputStream, "UTF-8"));

This does not fail and produces the exact same output as cat did:

Which also makes sense.

So my questions are:

  1. What is the expected behavior from a Java Application in this scenario?
  2. Why is there a difference between using the Channels.newReader (which returns a StreamDecoder) and simply using the regular InputStreamReader? Am i doing something wrong with how i read?

Any clarifications would be much appreciated.

Thanks :)


Solution

  • The difference between the behaviour actually goes right down to the StreamDecoder and Charset classes. The InputStreamReader gets a CharsetDecoder from StreamDecoder.forInputStreamReader(..) which does replacement on error

    StreamDecoder(InputStream in, Object lock, Charset cs) {
        this(in, lock,
        cs.newDecoder()
        .onMalformedInput(CodingErrorAction.REPLACE)
        .onUnmappableCharacter(CodingErrorAction.REPLACE));
    }
    

    while the Channels.newReader(..) creates the decoder with the default settings (i.e. report instead of replace, which results in an exception further up)

    public static Reader newReader(ReadableByteChannel ch,
                                   String csName) {
        checkNotNull(csName, "csName");
        return newReader(ch, Charset.forName(csName).newDecoder(), -1);
    }
    

    So they work differently, but there's no indication in documentation anywhere about the difference. This is badly documented, but I suppose they changed the functionality because you'd rather get an exception than have your data silently corrupted.

    Be careful when dealing with character encodings!