Search code examples
javajvmbytecodestack-frame

Why Java compiler generates weird local vars & stack map frames and how can I use them to reliably determine variable types?


I'm creating Java byte code instrumentation tool with the help of ASM framework, and need to determine and possibly change the type of local variables of a method. Very quickly I encountered a simple case where variables and stack map nodes look somewhat weird and don't give me enough information about variables being used:

public static void test() {
    List l = new ArrayList();
    for (Object i : l) {
        int a = (int)i;
    }
}

Gives the following bytecode(from Idea):

public static test()V
   L0
    LINENUMBER 42 L0
    NEW java/util/ArrayList
    DUP
    INVOKESPECIAL java/util/ArrayList.<init> ()V
    ASTORE 0
   L1
    LINENUMBER 43 L1
    ALOAD 0
    INVOKEINTERFACE java/util/List.iterator ()Ljava/util/Iterator;
    ASTORE 1
   L2
   FRAME APPEND [java/util/List java/util/Iterator]
    ALOAD 1
    INVOKEINTERFACE java/util/Iterator.hasNext ()Z
    IFEQ L3
    ALOAD 1
    INVOKEINTERFACE java/util/Iterator.next ()Ljava/lang/Object;
    ASTORE 2
   L4
    LINENUMBER 44 L4
    ALOAD 2
    CHECKCAST java/lang/Integer
    INVOKEVIRTUAL java/lang/Integer.intValue ()I
    ISTORE 3
   L5
    LINENUMBER 45 L5
    GOTO L2
   L3
    LINENUMBER 46 L3
   FRAME CHOP 1
    RETURN
   L6
    LOCALVARIABLE i Ljava/lang/Object; L4 L5 2
    LOCALVARIABLE l Ljava/util/List; L1 L6 0
    MAXSTACK = 2
    MAXLOCALS = 4

As one can see, all 4 explicitly and implicitly defined vars take 1 slot, 4 slots are reserved, but only 2 defined, in strange order (address 2 before address 0) and with a "hole" between them. List iterator is later written to this "hole" with ASTORE 1 without declaring the type of this variable first. Only after this operation stack map frame appears but it is unclear to me why only 2 variables are put into it, because later more than 2 are used. Later, with ISTORE 3, int is written into a variable slot again, without any declaration.

At this point it looks like I need to ignore variable definitions altogether, and infer all types by interpreting the bytecode, running the simulation of JVM stack.

Tried ASM EXPAND_FRAME option, but it is is useless, only changing the type of the single frame node to F_NEW with the rest still seen exactly as before.

Can anybody explain why do I see such a strange code and if I have other options beyond writing my own JVM intepreter?

Conclusion, based on all the answers(please correct me again if I'm wrong):

Variable definitions are only for matching source variable names/types to specific variable slots accessed at specific lines of code, apparently ignored by JVM class verifier and during code execution. Can be absent or don't match the actual bytecode.

Variable slots are treated like another stack, albeit accessed via 32-bit word indices, and it is always possible to overwrite its contents with different temporaries as long as you use matching types of load and store instructions.

Stack frame nodes contain the list of variables allocated from the beginning of the variable frame to the last variable that is going to be loaded in the subsequent code without storing first. This allocation map is expected to be the same regardless of what execution path was taken to reach its label. They also contain similar map for the operand stack as well. Their contents may be specified as increments relative to the preceding stack frame node.

Variables that only exist within linear sequences of code will only appear in the stack frame node if there are variables with longer lifetime allocated at higher slot address.


Solution

  • The short answer is that you will indeed need to write some kind of interpreter if you want to know the types of stack frame elements at each code location, though most of this work has already been done, but it still isn’t sufficient to restore the source level types of local variables and there is no general solution for that at all.

    As said in other answers, attributes like LocalVariableTable are truly intended to help restoring the formal declarations of local variables, e.g. when debugging, but only cover variables present in source code (well, actually that’s the compiler’s decision) and are not mandatory. It’s also not guaranteed to be correct, e.g. a bytecode transformation tool might have changed the code without updating these debugging attributes, but the JVM doesn’t care when you’re not debugging.

    As also said in other answers, the StackMapTable attribute is only meant to help bytecode verification, not to provide formal declarations. It will tell the stack frame state at branch merge points, as far as necessary for the verification.

    So for linear code sequences without branches, the type of local variables and operand stack entries is only determined by inference, but these inferred types are not guaranteed to match the formally declared types at all.

    To illustrate the issue, the following branch-free code sequences produce identical bytecode:

    CharSequence cs;
    cs = "hello";
    cs = CharBuffer.allocate(20);
    
    {
        String s = "hello";
    }
    {
        CharBuffer cb = CharBuffer.allocate(20);
    }
    

    It’s the compilers decision to reuse the local variable’s slot for variables with disjunct scopes, but all relevant compilers do.

    For the verification, only the correctness matters, so when storing a value of type X into a local variable slot, followed by reading it and accessing member Y.someMember, then X must be assignable to Y, regardless of whether the local variable’s declared type actually is Z, a supertype of X but a subtype of Y.

    In the absence of debugging attributes, you could be tempted to analyze the subsequent use to guess the actual type (I suppose, that is what most decompilers do), e.g. the following code

    CharSequence cs;
    cs = "hello";
    cs.charAt(0);
    cs = CharBuffer.allocate(20);
    cs.charAt(0);
    

    contains two invokeinterface CharSequence.charAt instructions, indicating that the variable’s actual type likely is CharSequence rather than String or CharBuffer, but the bytecode is still identical to, e.g.

    {
        String s = "hello";
        ((CharSequence)s).charAt(0);
    }
    {
        CharBuffer cb = CharBuffer.allocate(20);
        ((CharSequence)cb).charAt(0);
    }
    

    as these type casts only influence the subsequent method invocation, but do not generate bytecode instructions on its own, as these are widening casts.

    So it’s not possible to precisely restore the declared types of source level variables from the bytecode in a linear sequence and stackmap frame entries are not helpful either. Their purpose is to help verifying the correctness of the subsequent code (which can be reached through different code paths) and for this, it doesn’t need to declare all existing elements. It only has to declare the elements existing prior to the merge point and being actually used after the merge point. But it depends on the compiler whether (and which of) the entries actually not needed by the verifier are present.