Just-in-Time compilation of Java bytecode

We are currently working on the JIT compilation part of our own Java Virtual Machine implementation. Our idea was now to do a simple translation of the given Java bytecode into opcodes, writing them to executable memory and CALLing right to the start of the method.

Assuming the given Java code would be:

int a = 13372338;
int b = 32 * a;
return b;

Now, the following approach was made (assuming that the given memory starts at 0x1000 & the return value is expected in eax):

0x1000: first local variable - accessible via [eip - 8]
0x1004: second local variable - accessible via [eip - 4]
0x1008: start of the code - accessible via [eip]

Java bytecode | Assembler code (NASM syntax)
--------------|------------------------------------------------------------------
              | // start
              | mov edx, eip
              | push ebx
              |         
              | // method content
ldc           | mov eax, 13372338
              | push eax
istore_0      | pop eax
              | mov [edx - 8], eax
bipush        | push 32
iload_0       | mov eax, [edx - 8]
              | push eax
imul          | pop ebx
              | pop eax
              | mul ebx
              | push eax
istore_1      | pop eax
              | mov [edx - 4], eax
iload_1       | mov eax, [edx - 4]
              | push eax
ireturn       | pop eax
              |         
              | // end
              | pop ebx
              | ret

This would simply use the stack just like the virtual machine does itself. The questions regarding this solution are:

Is this method of compilation viable?
Is it even possible to implement all the Java instructions this way? How could things like athrow/instanceof and similar commands be translated?

Solution

This method of compilation works, is easy to get up and running, and it at least removes interpretation overhead. But it results in pretty large amounts of code and pretty awful performance. One big problem is that it transliterates the stack operations 1:1, even though the target machine (x86) is a register machine. As you can see in the snippet you posted (as well as any other code), this always results in several stack manipulation opcodes for every single operation, so it uses the registers - heck, the the whole ISA - about as ineffectively as possible.

You can also support complicated control flow such as exceptions. It's not very different from implementing it in an interpreter. If you want good performance you don't want to perform work every time you enter or exit a try block. There are schemes to avoid this, used by both C++ and other JVMs (keyword: zero-cost or table-driven exception handling). These are quite complex and complicated to implement, understand and debug, so you should go with a simpler alternative first. Just keep it in mind.

As for the generated code: The first optimization, one which you'll almost definitely will need, is converting the stack operations into three address code or some other representation that uses registers. There are several papers on this and implementations of this, so I won't elaborate unless you want me to. Then, of course, you need to map these virtual registers onto physical registers. Register allocation is one of the most well-researched topics in compiler constructions, and there are at least half a dozen heuristics that are reasonably effective and fast enough to use in a JIT compiler. One example off the top of my head is linear scan register allocation (specifically creates for JIT compilation).

Beyond that, most JIT compilers focused on performance of the generated code (as opposed to quick compilation) use one or more intermediate formats and optimize the programs in this form. This is basically your run of the mill compiler optimization suite, including veterans like constant propagation, value numbering, re-association, loop invariant code motion, etc. - these things are not only simple to understand and implement, they've also been described in thirty years of literature up to and including textbooks and Wikipedia.

The code you'll get with the above will be pretty good for straigt-line code using primitives, arrays and object fields. However, you won't be able to optimize method calls at all. Every method is virtual, which means inlining or even moving method calls (for example out of a loop) is basically impossible except in very special cases. You mentioned that this is for a kernel. If you can accept using a subset of Java without dynamic class loading, you can do better (but it'll be nonstandard) by assuming the JIT knows all classes. Then you can, for example, detect leaf classes (or more generally methods which are never overriden) and inline those.

If you do need dynamic class loading, but expect it to be rare, you can also do better, though it takes more work. The advantage is that this approach generalizes to other things, like eliminating logging statements completely. The basic idea is specializing the code based on some assumptions (for example, that this static does not change or that no new classes are loaded), then de-optimizing if those assumptions are violated. This means you'll sometimes have to re-compile code while it is running (this is hard, but not impossible).

If you go further down this road, its logical conclusion is trace-based JIT compilation, which has been applied to Java, but AFAIK it didn't turn out to be superior to method-based JIT compilers. It's more effective when you have to make dozens or hundreds of assumptions to get good code, as it happens with highly dynamic languages.