Search code examples
javacharacter-encoding

OutputStream translating bytes into characters with charset (opposite of OutputStreamWriter)


I need to manipulate content as it is being written to an OutputStream. Specifically, I need to replace CR or LF with CRLF to canonicalize text. This is easy for simple character sets where CR=13 and LF=10, but not so simple with multi-byte character sets. The characters should be replaced, not the bytes. It is non-trivial in general to do that in the output stream itself.

The built-in class OutputStreamWriter converts from characters to bytes for a configured encoding. I'm looking for a class that does the opposite, that is an OutputStream configured with a character set that buffers data as needed and translates the written bytes into characters with the character set (or throws on invalid byte sequences), making the characters available in some way, for example by forwarding the call to a Writer.

In other words I want to convert from bytes to characters on-the-fly as content is being written. I could write everything to a buffer and read it back with an InputStreamReader, but that is inefficient for very large payloads that won't fit in memory.

Is there a class like this somewhere (ideally open source, as I don't think it is built in)? If not, are there similar examples for efficient streaming conversion I could use as a starting point? The JDK classes I've seen are optimized for converting many bytes at a time, not for streaming use.


Solution

  • I wrote an implementation based on CharsetDecoder. Create a decoder and allocate a ByteBuffer and CharBuffer in the constructor:

    decoder = charset.newDecoder();
    byteBuf = ByteBuffer.allocate(bufferSize);
    charBuf = CharBuffer.allocate(bufferSize);
    

    Then implement write:

    public void write(int b) throws IOException {
        if (!byteBuf.hasRemaining()) {
            decodeAndWriteByteBuffer(false);
        }
        byteBuf.put((byte) b);
    }
    

    And decodeAndWriteByteBuffer:

    private void decodeAndWriteByteBuffer(boolean endOfInput) throws IOException {
        byteBuf.flip();
        CoderResult cr;
        do {
            cr = byteBuf.hasRemaining() || endOfInput
               ? decoder.decode(byteBuf, charBuf, endOfInput)
               : CoderResult.UNDERFLOW;
            if (cr.isUnderflow()) {
                if (endOfInput) {
                    do {
                        cr = decoder.flush(charBuf);
                        writeCharBuffer();
                    } while (cr.isOverflow());
    
                    if (cr.isError()) {
                        cr.throwException();
                    }
                }
            } else if (cr.isOverflow()) {
                writeCharBuffer();
            } else {
                cr.throwException();
            }
        } while (cr.isOverflow());
        byteBuf.compact();
    }
    

    The remaining details are left as an exercise to the reader. It seems to work, though it is to early to say anything about performance.