Given a String of length Integer.MAX_VALUE
which contains characters that require more than one byte to represent, such as Chinese ideograms, what result would I get if I executed String.getBytes()
? Is there any good way of testing for this type of error?
String is a sophisticated immutable class. Historically it just held char[]
array of UTF-16 two byte chars. And then String.getBytes(StandardCharsets.UTF_8)
might indeed be assumed to overflow the index range.
However nowadays String already holds a byte[] value
. This is for compacting strings in an other Charset. The problem still exists, for instance a compacted ISO-8859-1 String of almost Integer.MAX_VALUE can explode in UTF-8 (even with String.toCharArray()
). An OutOfMemoryException
.
Hence there are some different overflows possible, but for UTF16 chars to getBytes(UTF-8):
private static final int MAX_INDEX = Integer.MAX_VALUE;
void checkUtf8Bytes(String s) {
if (s.length() < MAX_INDEX / 6) {
return; // Not hurt by UTF-8 6 byte sequences.
}
if (s.codePoints().mapToLong(this::bytesNeeded).sum() > MAX_INDEX) {
throw IllegalArgumentException();
}
}
private int bytesNeeded(int codePoint) {
if (codePoint < 128) {
return 1;
} else if (codePoint ...) {
...
}
I think it is easier to catch an OutOfMemoryException.
Mind that the normal String with UTF-16 chars in the bytes can hold no more that Integer.MAX_VALUE / 2 bytes.