i am using bytecode analysis to get all imported classes of a classfile (with BCEL). Now, when i read the constant pool, not all imported classes are mentioned as CONSTANT_Class (see spec) but only as CONSTANT_Utf8. My question now: Am i not able to rely solely on the CONSTANT_Class-entries in the constant pool to read the imported files? do i really have to look at every entry and guess, if its a class name? This also does not seem to be correct in every situation imo. Or do i have to read through the whole bytecode? regards
No, it is not correct to use CONSTANT_Class_info entries alone to discover dependencies on other classes/interfaces. If you're parsing input files you trust or can tolerate incorrect information, you can get away with parsing the constant pool only except for one corner case. To get precise information on arbitrary input you need to parse the whole class file. (I assume by "dependencies" you mean those classes or interfaces without which loading or linking a class may result in exceptions, as described in JVMS chapter 5. This doesn't include classes obtained via Class.forName
or other reflective means.)
Consider the following class.
public class Main {
public static void main(String[] args) {
identity(null);
}
public static Object identity(Foo x) {
return x;
}
}
javap -p -v Main.class
prints:
Classfile /C:/Users/jbosboom/Documents/stackoverflow/build/classes/Main.class
Last modified Jul 2, 2014; size 346 bytes
MD5 checksum 2237cda2a15a58382b0fb98d6afacc7e
Compiled from "Main.java"
public class Main
SourceFile: "Main.java"
minor version: 0
major version: 52
flags: ACC_PUBLIC, ACC_SUPER
Constant pool:
#1 = Methodref #3.#17 // java/lang/Object."<init>":()V
#2 = Class #18 // Main
#3 = Class #19 // java/lang/Object
#4 = Utf8 <init>
#5 = Utf8 ()V
#6 = Utf8 Code
#7 = Utf8 LineNumberTable
#8 = Utf8 LocalVariableTable
#9 = Utf8 this
#10 = Utf8 LMain;
#11 = Utf8 identity
#12 = Utf8 (LFoo;)Ljava/lang/Object;
#13 = Utf8 x
#14 = Utf8 LAAA;
#15 = Utf8 SourceFile
#16 = Utf8 Main.java
#17 = NameAndType #4:#5 // "<init>":()V
#18 = Utf8 Main
#19 = Utf8 java/lang/Object
#20 = Utf8 java/lang/Thread
#21 = Class #20 // java/lang/Thread
#21 = Utf8 (LBar;)LFakename;
{
public Main();
descriptor: ()V
flags: ACC_PUBLIC
Code:
stack=1, locals=1, args_size=1
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return
LineNumberTable:
line 6: 0
LocalVariableTable:
Start Length Slot Name Signature
0 5 0 this LMain;
public static java.lang.Object identity(Foo);
descriptor: (LFoo;)Ljava/lang/Object;
flags: ACC_PUBLIC, ACC_STATIC
Code:
stack=1, locals=1, args_size=1
0: aload_0
1: areturn
LineNumberTable:
line 11: 0
LocalVariableTable:
Start Length Slot Name Signature
0 2 0 x LAAA;
}
The class Foo
, referenced as a parameter to the method identity
, does not appear in the constant pool as a CONSTANT_Class_info entry. It does appear in the method descriptor for identity
(entry #12). Field descriptors may also reference classes not appearing as CONSTANT_Class_info entries. Thus to find all the dependencies from the constant pool alone, you need to look at all UTF8 entries.
The corner case: Some UTF8 entries may exist to be referenced by CONSTANT_String_info entries. Duplicate UTF8 entries will be merged, so one UTF8 entry might be a method descriptor, a string literal, or both. If you're only parsing the constant pool, you must live with this ambiguity (probably by overapproximating and treating it as a dependency).
If you trust the input to have been produced by a well-behaved Java compiler under your control, you can parse all UTF8 entries, mindful of the string corner case, and stop reading here. If you need to defend against an attacker feeding your tool handcrafted class files (e.g., you're writing a decompiler and the attacker wants to prevent decompilation), you need to parse the entire class file. Here's a few examples of the potential problems.
Main
. The JVM may or may not try to resolve this reference (JVMS 5.4 permits both lazy and eager loading). As the class exists, either way, no error will be raised, so this extra entry is harmless, but it will fool tools looking at the constant pool into thinking Thread is a dependency.That's just what I came up with off the top of my head. A clever attacker going through the JVMS with a fine-tooth comb could probably find more places to add entries to the constant pool that look used but aren't. If you need precise information even in the face of an attacker, you need to parse the whole class file and understand how a JVM will use it.