Bytecode for Java string literals longer than 65535 bytes

I have been reading Java bytecode from a variety of files to help with my understanding of the .class files for a project where I need to integrate with a 3rd party library which has no source code and poor documentation available.

For my own amusement I ran the Apache BCEL library through my maven repository to see where the rarer class and method attributes such as type annotations are used and why.

I stumbled across a problem with a specific jar which would not decode one of the constant fields - CONSTANT_Utf8_info specifically. The library is icu4j-2.6.1.jar (com.ibm.icu:icu4j), specifically the LocaleElements_zh__PINYIN.class file. Apache BCEL fails (and my own attemps at a quick bytecode reader complying with the JVMS version 8 and 9) stumbles into the same problem where they misread this constant and then reads the next byte which evaluates as an incorrect constant tag (0x3C/60).

Doing a quick check to see if I can use the class in an IDE fails (cannot resolve symbol). Investigating the actual bytecode using a Hex Editor, shows that the constant at that offset (0x1AC) is a Utf8 constant (tag=0x01) with a length of 0x480E. Moving forward that amount in the file indeed has a byte 0x3C at that location. Visually looking at the file, I can see that the constant in question ends at location 0x149BD which makes the actual length of the string 0x1480E (which is essentially the first three bytes at location 0x1AC). This is of course not possible as per the JVM classfile specification which has a maximum length of 0xFFFF or 65535 for a Utf8 constant. The classfile is quite old - version 46 or Java 1.2.

I've pored over the specification and tried different possible implementations (both less and more strict) to try and parse this constant but it either cannot parse it, or it breaks the reading of other valid Utf8 constants.

My question then is, have I missed something, or is it a compiler mistake in which case my second question is how could this have happened in the first place - compilers tend to be relatively thoroughly checked. Lastly, how does the Java compiler normally manage string literals that are longer than 65535 bytes in length?

Solution

Since you stated that “the classfile is quite old - version 46 or Java 1.2”, it is indeed possible that the classfile is simply broken due to the compiler of that time not rejecting the code when exceeding the limits.

See JDK-4309152 : # Compiler silently generates bytecode that exceeds VM limits:

The compiler does not properly enforce certain limits on the number or size of various classfile components. This results in code that appears to compile successfully, but fails at runtime during verification.

These were originally reported as separate bugs, which have now been closed as duplicates of this one. The original bug numbers are included with each item below.

…

There is a 64k limit on UTF-8 encoded strings. (4071592)

This bug is reported to be fixed for 1.3.1_10, so it fits into the time frame.

Note that the referenced bug #4071592 refers to throwing a UTFDataFormatException when trying to write overly large strings in 1.2.0 and earlier, but #4303354 reports that invalid strings are silently generated in 1.3.0. So if the problematic class file has been generated by javac, it must have been between version 1.3.0 and 1.3.1_10 with -target 1.2.

Since the fix, the compiler’s standard behavior is to generate a compiler error if a certain construct exceeds class file/JVM limits.