I am trying to decode UTF8 byte by byte with charset decoder. Is this possible?
The following code
public static void main(String[] args) {
Charset cs = Charset.forName("utf8");
CharsetDecoder decoder = cs.newDecoder();
CoderResult res;
byte[] source = new byte[] {(byte)0xc3, (byte)0xa6}; // LATIN SMALL LETTER AE in UTF8
byte[] b = new byte[1];
ByteBuffer bb = ByteBuffer.wrap(b);
char[] c = new char[1];
CharBuffer cb = CharBuffer.wrap(c);
decoder.reset();
b[0] = source[0];
bb.rewind();
cb.rewind();
res = decoder.decode(bb, cb, false);
System.out.println(res);
System.out.println(cb.remaining());
b[0] = source[1];
bb.rewind();
cb.rewind();
res = decoder.decode(bb, cb, false);
System.out.println(res);
System.out.println(cb.remaining());
}
gives the following output.
UNDERFLOW
1
MALFORMED[1]
1
Why?
My theory is that the problem with the way that you are doing it is that in the "underflow" condition, the decoder leaves the unconsumed bytes in the input buffer. At least, that is my reading.
Note this sentence in the javadoc:
"In any case, if this method is to be reinvoked in the same decoding operation then care should be taken to preserve any bytes remaining in the input buffer so that they are available to the next invocation. "
But you are clobbering the (presumably) unread byte.
You should be able to check whether my theory / interpretation is correct by looking at how many bytes remain unconsumed in bb
after the first decode(...)
call.
If my theory is correct then the answer is that you cannot decode UTF-8 by providing the decoder with byte buffers containing exactly one byte. But you could implement byte-by-byte decoding by starting with a ByteBuffer containing one byte and adding extra bytes until the decoder succeeds in outputing a character. Just make sure that you don't clobber input bytes that haven't been consumed yet.
Note that decoding like this is not efficient. The API design is optimized for decoding a large number of bytes in one go.