I have a 16-bit register-based virtual machine, I want to know what are the steps of compiling it to actual x86 machine code? I'm not looking to make a JIT compiler unless it is necessary to be able to link the compiled code with another executable / DLL.
The VM is made such that if the VM is added to a project, special language constructs can be added. (for example, if it is embedded into a game engine, an "Entity" object type may be added, and several C functions from the engine might be exposed.) This will cause the code to be completely dependent on certain exposed C functions or exposed C++ classes, in the application it's embedded into.
How would this sort of "linking" be possible if the script code is compiled from VM bytecode into a native EXE?
It is also register-based like Lua's VM, as in all basic variables are stored in "registers" which is a huge C array. A register-pointer is incremented or decremented when the scope changes, so register numbers are relative, similar to stack pointer. E.g.:
int a = 5;
{
int a = 1;
}
might be, in virtual machine pseudo-assembly:
mov_int (%r0, $5)
; new scope, the "register pointer" is then incremented by the number
; of bytes that are used to store local variables in this new scope. E.g. int = 4 bytes
; say $rp is the "register pointer"
add (%rp, $4) ; since size of int is usually 4 bytes
; this is if registers are 1 bytes in size, if they were
; 4 bytes in size it would just be adding $1
mov_int (%r0, $1) ; now each register "index" is offset by 4,
; this is now technically setting %r4
; different instructions are used to get values above current scope
sub (%rp, $4) ; end of scope so reset %rp
My question about this part is, would I have to use the stack pointer for this sort of thing? The base pointer? What could I use to replace this concept?
The VM is made such that if the VM is added to a project, special language constructs can be added. (for example, if it is embedded into a game engine, an "Entity" object type may be added, and several C functions from the engine might be exposed.) This will cause the code to be completely dependent on certain exposed C functions or exposed C++ classes, in the application it's embedded into.
There's many ways to implement this kind of cross-language interfacing. Whether you're running VM bytecode or native machinecode isn't going to matter much here unless you need an interface with very low overhead. The main consideration is the nature of your language — especially whether it has static or dynamic typing.
Generally speaking these are the two most common approaches (you may already be familiar with them):
(a) The 'foreign-function-interface' approach, where your language/runtime offers facilities to automatically wrap functions and data from C. Examples include LuaJIT FFI, js-ctypes and P/Invoke. Most FFIs can operate on CDECL/STDCALL functions and POD structures; some have varying levels of support for C++ or COM classes.
(b) The 'runtime-API' approach, where your runtime exposes a C API you can use to manually construct/manipulate objects for use in your language. Lua has an extensive API for this (example) as does Python.
How would this sort of "linking" be possible if the script code is compiled from VM bytecode into a native EXE?
So you're probably thinking about how to e.g. bake foreign function addresses into your generated machinecode. Well, if you have the proper FFI infrastructure in place there's no reason you can't do this, as long as you're aware of how shared-library imports work (import address tables, relocation, fixups, etc.).
If you don't know much about shared-libraries, I think by spending some time researching that area you'll start to get a much clearer idea of the ways you can implement FFI in your compiler.
However if it would probably be easier to take a slightly more dynamic approach, e.g.: LoadLibrary()
, GetProcAddress()
, then wrap the function pointer as an object of your language.
It's unfortunately very hard to give more specific suggestions without knowing anything about the language/VM in question.
[…] My question about this part is, would I have to use the stack pointer for this sort of thing? The base pointer? What could I use to replace this concept?
I'm not entirely sure what the purpose of this 'register array' scheme is.
In a language with lexical scoping, it's my understanding that when compiling a function you typically enumerate every variable declared in its body and allocate a block of stack space large enough to hold all the variables (ignoring the complex topic of CPU register allocation). The code can address these variables using the stack pointer or (more often) the base pointer.
If a variable in an inner scope shadows a variable in an outer scope like your example, they're assigned separate memory spaces on the stack — because as far as the compiler is concerned they are different variables.
Without understanding the rationale behind whatever scheme the VM is using I can't really suggest how it should translate to machinecode. Maybe someone with more experience programming bytecode compilers can give you the answer you're after.
However it may be that your VM's approach is actually similar to what I've described, in which case adapting it for machinecode compilation should actually be very straightforward - just a matter of translating your virtual local-variable memory space into stack space.