java performance static final initializer

Java super-tuning, a few questions

Before I ask my question can I please ask not to get a lecture about optimising for no reason. Consider the following questions purely academic.

I've been thinking about the efficiency of accesses between root (ie often used and often accessing each other) classes in Java, but this applies to most OO languages/compilers. The fastest way (I'm guessing) that you could access something in Java would be a static final reference. Theoretically, since that reference is available during loading, a good JIT compiler would remove the need to do any reference lookup to access the variable and point any accesses to that variable straight to a constant address. Perhaps for security reasons it doesn't work that way anyway, but bear with me...

Say I've decided that there are some order of operations problems or some arguments to pass at startup that means I can't have a static final reference, even if I were to go to the trouble of having each class construct the other as is recommended to get Java classes to have static final references to each other. Another reason I might not want to do this would be... oh, say, just for example, that I was providing platform specific implementations of some of these classes. ;-)

Now I'm left with two obvious choices. I can have my classes know about each other with a static reference (on some system hub class), which is set after constructing all classes (during which I mandate that they cannot access each other yet, thus doing away with order of operations problems at least during construction). On the other hand, the classes could have instance final references to each other, were I now to decide that sorting out the order of operations was important or could be made the responsibility of the person passing the args - or more to the point, providing platform specific implementations of these classes we want to have referencing each other.

A static variable means you don't have to look up the location of the variable wrt to the class it belongs to, saving you one operation. A final variable means you don't have to look up the value at all but it does have to belong to your class, so you save 'one operation'. OK I know I'm really handwaving now!

Then something else occurred to me: I could have static final stub classes, kind of like a wacky interface where each call was relegated to an 'impl' which can just extend the stub. The performance hit then would be the double function call required to run the functions and possibly I guess you can't declare your methods final anymore. I hypothesised that perhaps those could be inlined if they were appropriately declared, then gave up as I realised I would then have to think about whether or not the references to the 'impl's could be made static, or final, or...

So which of the three would turn out fastest? :-)

Any other thoughts on lowering frequent-access overheads or even other ways of hinting performance to the JIT compiler?

UPDATE: After running several hours of test of various things and reading http://www.ibm.com/developerworks/java/library/j-jtp02225.html I've found that most things you would normally look at when tuning e.g. C++ go out the window completely with the JIT compiler. I've seen it run 30 seconds of calculations once, twice, and on the third (and subsequent) runs decide "Hey, you aren't reading the result of that calculation, so I'm not running it!".

FWIW you can test data structures and I was able to develop an arraylist implementation that was more performant for my needs using a microbenchmark. The access patterns must have been random enough to keep the compiler guessing, but it still worked out how to better implement a generic-ified growing array with my simpler and more tuned code.

As far as the test here was concerned, I simply could not get a benchmark result! My simple test of calling a function and reading a variable from a final vs non-final object reference revealed more about the JIT than the JVM's access patterns. Unbelievably, calling the same function on the same object at different places in the method changes the time taken by a factor of FOUR!

As the guy in the IBM article says, the only way to test an optimisation is in-situ.

Thanks to everyone who pointed me along the way.

Solution

See the update, I answered my own question by doing some benchmarking, and found that there are far greater gains in unexpected areas and that performance for simple operations like referencing members is comparable on most modern systems where performance is limited more by memory bandwidth than CPU cycles.