Search code examples
javaruntime-error

What happens when I try to get the bytes for a String but the conversion from Character to Byte overflows the Integer length?


Given a String of length Integer.MAX_VALUE which contains characters that require more than one byte to represent, such as Chinese ideograms, what result would I get if I executed String.getBytes()? Is there any good way of testing for this type of error?


Solution

  • String is a sophisticated immutable class. Historically it just held char[] array of UTF-16 two byte chars. And then String.getBytes(StandardCharsets.UTF_8) might indeed be assumed to overflow the index range.

    However nowadays String already holds a byte[] value. This is for compacting strings in an other Charset. The problem still exists, for instance a compacted ISO-8859-1 String of almost Integer.MAX_VALUE can explode in UTF-8 (even with String.toCharArray()). An OutOfMemoryException.

    Hence there are some different overflows possible, but for UTF16 chars to getBytes(UTF-8):

    private static final int MAX_INDEX = Integer.MAX_VALUE;
    
    void checkUtf8Bytes(String s) {
        if (s.length() < MAX_INDEX / 6) {
            return; // Not hurt by UTF-8 6 byte sequences.
        }
        if (s.codePoints().mapToLong(this::bytesNeeded).sum() > MAX_INDEX) {
            throw IllegalArgumentException();
        }
    }
    
    private int bytesNeeded(int codePoint) {
        if (codePoint < 128) {
            return 1;
        } else if (codePoint ...) {
        ...
    }
    

    I think it is easier to catch an OutOfMemoryException.

    Mind that the normal String with UTF-16 chars in the bytes can hold no more that Integer.MAX_VALUE / 2 bytes.