ASM documentation says a label represent a basic block, and it is a node in the control graph. So I test the visitLabel
method on this simple example:
public static void main(String[] args) {
int x = 3, y = 4;
if (x < y) {
x++;
}
}
For the visitLabel
method, I instrument it with a native API: setID(int id)
, where the id is incremental. In this example, a CFG should have 3 nodes: one at the beginning, and one for each branch of the if statement. So I expect setID
would be called in 3 locations. However, it is called 5 times, and there are a lot of nop
instructions. Could anybody explain for me why?
Here is the instrumented bytecode for the above program.
public static void main(java.lang.String[]);
Code:
0: iconst_2
1: invokestatic #13 // Method setId:(I)V
4: iconst_3
5: istore_1
6: iconst_3
7: invokestatic #13 // Method setId:(I)V
10: iconst_4
11: istore_2
12: iconst_4
13: invokestatic #13 // Method setId:(I)V
16: iload_1
17: iload_2
18: if_icmpge 28
21: iconst_5
22: invokestatic #13 // Method setId:(I)V
25: iinc 1, 1
28: bipush 6
30: invokestatic #13 // Method setId:(I)V
33: return
34: nop
35: nop
36: nop
37: nop
38: athrow
What I don't understand is why there is a label
before each istore
instruction. There is no branching to make it a new node in the CFG.
The primary purpose of a Label
is to denote a position in the bytecode sequence. Since this is needed for branch targets, you can use them to identify basic blocks. But you have to be aware, that they are also used for reporting line numbers, when a LineNumberTable
attribute is present and for reporting local variable scopes when a LocalVariableTable
attribute is present as well as, for newer class files, their type annotations recorded in a RuntimeVisibleTypeAnnotations
attribute. Further, labels may mark the protected area of an exception handler. For code generated from Java source code, this protected area matches the try
block, so its a basic block, but that doesn’t need to hold for other bytecode.
See
Since the scope of local variables may span the last return
instruction, it is possible to encounter labels after that last instruction, which is what happens in your case. You are injecting a bipush 7, invokestatic #13
after the return
instruction, resulting in unreachable code.
Apparently, you are also using the COMPUTE_FRAMES
options to let ASM recalculate stack map frames from scratch, but it is impossible to calculate frames for unreachable code, due to the unknown initial stack state. ASM solves this problem by replacing unreachable code with nop
instructions, followed by a single athrow
statement. For this sequence, it’s possible to specify a valid initial stack frame and it has no impact on the execution (as the code is unreachable).
As you can see, four nop
instructions plus one athrow
instruction span five bytes, which is the same size as the replaced bipush 7, invokestatic #13
sequence had.
You can get rid of most of these reported labels, by specifying ClassReader.SKIP_DEBUG
to its accept
method. Then, you get only one reported label for your example, the branch target associated with the if
statement. But you have to handle the visitJumpInsn
to identify the start of the conditional code.
So, to identify all basic blocks, you have to handle all branch instructions, i.e. via visitJumpInsn
, visitLookupSwitchInsn
, and visitTableSwitchInsn
, as well as all end instructions, i.e. athrow
and all variants of return
. Further, you need to process all visitTryCatchBlock
calls. If you need to identify potential targets of branch instructions in a single pass, I’d use visitFrame
instead of labels, as frames are mandatory at all branch targets for class file version of 51 (Java 7) or higher.
By the way, when all you’re injecting are these sequences of loading a constant and calling a static method (at reachable locations), I’d use COMPUTE_MAXS
instead of COMPUTE_FRAMES
, as an expensive recalculation is not necessary when the general code structure does not change.