Search code examples
compilationx86virtualboxvirtualizationx86-emulation

How does dynamic recompilation handle instruction-pointer checks in software virtualization?


(This question isn't intended to be specific to VirtualBox or x86 per se, but since they're the best examples I'm aware of, I'll be referencing them and asking how VBox handles some scenarios. If you're aware of other solutions that aren't used by VBox, do consider mentioning them as well.)


I've read how VirtualBox does software virtualization, and I don't understand the following.

Before executing ring 0 code, CSAM [Code Scanning and Analysis Manager] scans it recursively to discover problematic instructions. PATM [Patch Manager] then performs in-situ patching, i.e. it replaces the instruction with a jump to hypervisor memory where an integrated code generator has placed a more suitable implementation. In reality, this is a very complex task as there are lots of odd situations to be discovered and handled correctly. So, with its current complexity, one could argue that PATM is an advanced in-situ recompiler.

Consider the following sample instruction sequence in ring-0 code:

    call foo

foo:
    mov EAX, 1234
    mov EDX, [ESP]
    cmp EDX, EAX
    jne bar
    call do_something_special_if_return_address_was_1234
bar:
    ...

The callee here is testing to see if the return address to the caller is 1234, and if it is, it does something special. Clearly patching will change the return address, so we need to be able to handle it.

VirtualBox's documentation says it discovers "problematic" instructions and patches them in-place, but I don't really understand how this can work, for 2 reasons:

  1. It appears that any instruction that exposes the instruction pointer is "problematic", of which call is probably the most common (and extremely so). Does this mean VirtualBox must analyze, and potentially patch, every call instruction it sees in ring 0? Doesn't this make performance drop off a cliff? How do they handle this with high performance? (The cases they mention in their documentation are quite obscure, so I'm confused why they didn't mention such a common instruction if it occurs. And if it isn't a problem, I don't understand why.)

  2. If the instruction stream happens to be modified (e.g. dynamically loading/unloading kernel modules), VirtualBox must dynamically detect this and garbage-collect unreachable recompiled instructions. Otherwise, it will have a memory leak. But this means every mov instruction (and push instruction, and everything else that writes to memory) will now have to be analyzed and potentially patched, possibly repeatedly, because it may be modifying code that was patched. This would seem to essentially degenerate all guest ring-0 code into near-full software emulation (since the target of a move is not known during recompilation), which would make the virtualization cost skyrocket, and yet this is not the impression I get from reading the documentation. Is this not a problem? How is this handled efficiently?

Please note that I am not asking about hardware-assisted virtualization like Intel VT or AMD-V and I am not interested in reading about those. I'm well aware that they avoid these problems altogether, but my question is about pure software virtualization.


Solution

  • At least for QEMU, it seems like the answer is that even in the translated code, there is a separate emulated "stack" that is set up with the same values as the code would have when running natively, and this "stack" is the one read by the emulated code, which sees the same values as if it were running natively.

    This means that emulated code can't be translated to use call, ret, or any other stack-using instructions directly, since those would not use the emulated stack. Therefore, those calls are replaced by jumps to various bits of thunk code that does the right thing in terms of calling the equivalent translated code.

    Details for QEMU

    The OP's (reasonable) assumption seems to be that call and ret instructions would appear in the translated binary, and the stack would reflect the addresses of the dynamically translated code. What actually happens (in QEMU) is that the call and ret instructions are removed and replaced with control flow that doesn't use the stack, and values on the stack are set to have the same values as they would in native code.

    That is, the OP's mental model is that the result of the code translation is somewhat like the native code, with some patches and modifications. At least in the case of QEMU, that's not really the case—every basic block is heavily translated via the Tiny Code Generator (TCG), first to an intermediate representation and then to the target architecture (even if the source and destination archs are the same, as in my case). This deck has a good overview of a lot of the technical details, including an overview of TCG as shown below.

    Diagram of TCG Flow

    The resulting code is usually nothing like the input code, and typically increases in size by about 3 times. Registers are often used fairly sparingly and you often see back-to-back redundant sequences. Particularly relevant to this question, is that essentially all control flow instructions are quite differently, so ret and call instructions in the native code pretty much never translate into plain call or ret in the translated code.

    An example: First, some C code with a return_address() call that simply returns the return address, and a main() that prints this function:

    #include <stdlib.h>
    #include <stdio.h>
    
    __attribute__ ((noinline)) void* return_address() {
        // stuff here?
        return __builtin_return_address(0);
    }
    
    
    int main(int argc, char **argv) {
        void *a = return_address();
        printf("%p\n", a);
    }
    

    The noinline here is important since otherwise gcc just may just inline the function and hardcode the address directly into the assembly without making a call or accessing the stack at all!

    With gcc -g -O1 -march=native this compiles to:

    0000000000400546 <return_address>:
      400546:       48 8b 04 24             mov    rax,QWORD PTR [rsp]
      40054a:       c3                      ret    
    
    000000000040054b <main>:
      40054b:       48 83 ec 08             sub    rsp,0x8
      40054f:       b8 00 00 00 00          mov    eax,0x0
      400554:       e8 ed ff ff ff          call   400546 <return_address>
      400559:       48 89 c2                mov    rdx,rax
      40055c:       be 04 06 40 00          mov    esi,0x400604
      400561:       bf 01 00 00 00          mov    edi,0x1
      400566:       b8 00 00 00 00          mov    eax,0x0
      40056b:       e8 c0 fe ff ff          call   400430 <__printf_chk@plt>
      400570:       b8 00 00 00 00          mov    eax,0x0
      400575:       48 83 c4 08             add    rsp,0x8
      400579:       c3                      ret    
    

    Note that return_address() returns [rsp] just like the OP's example. The main() function sticks it in rdx, where printf will read it from.

    We expect the caller's return address as seen by return_address to be the instruction following the call, 0x400559:

      400554:       e8 ed ff ff ff          call   400546 <return_address>
      400559:       48 89 c2                mov    rdx,rax
    

    ... and indeed that's what we see when we run it natively:

    person@host:~/dev/test-c$ ./qemu-test 
    0x400559
    

    Lets try it in QEMU:

    person@host:~/dev/test-c$ qemu-x86_64 ./qemu-test
    0x400559
    

    It works! Note that QEMU is by default translating all code and putting it far away from the usual native location (as we will see shortly), so we don't need any special instructions to trigger translation.

    How does that work under the covers? We can use the -d in_asm,out_asm option with QEMU to see what it is making of this code.

    First, the code leading up to the call (the IN portion is the native code, and the OUT is what QEMU translates it to—sorry for AT&T syntax, I can't figure out how to change that in QEMU):

    IN: main
    0x000000000040054b:  sub    $0x8,%rsp
    0x000000000040054f:  mov    $0x0,%eax
    0x0000000000400554:  callq  0x400546
    
    OUT: [size=123]
    0x557c9cf33a40:  mov    -0x8(%r14),%ebp
    0x557c9cf33a44:  test   %ebp,%ebp
    0x557c9cf33a46:  jne    0x557c9cf33aac
    0x557c9cf33a4c:  mov    0x20(%r14),%rbp
    0x557c9cf33a50:  sub    $0x8,%rbp
    0x557c9cf33a54:  mov    %rbp,0x20(%r14)
    0x557c9cf33a58:  mov    $0x8,%ebx
    0x557c9cf33a5d:  mov    %rbx,0x98(%r14)
    0x557c9cf33a64:  mov    %rbp,0x90(%r14)
    0x557c9cf33a6b:  xor    %ebx,%ebx
    0x557c9cf33a6d:  mov    %rbx,(%r14)
    0x557c9cf33a70:  sub    $0x8,%rbp
    0x557c9cf33a74:  mov    $0x400559,%ebx
    0x557c9cf33a79:  mov    %rbx,0x0(%rbp)
    0x557c9cf33a7d:  mov    %rbp,0x20(%r14)
    0x557c9cf33a81:  mov    $0x11,%ebp
    0x557c9cf33a86:  mov    %ebp,0xa8(%r14)
    0x557c9cf33a8d:  jmpq   0x557c9cf33a92
    0x557c9cf33a92:  movq   $0x400546,0x80(%r14)
    0x557c9cf33a9d:  mov    $0x7f177ad8a690,%rax
    0x557c9cf33aa7:  jmpq   0x557c9cef8196
    0x557c9cf33aac:  mov    $0x7f177ad8a693,%rax
    0x557c9cf33ab6:  jmpq   0x557c9cef8196
    

    A key part is here:

    0x557c9cf33a74:  mov    $0x400559,%ebx
    0x557c9cf33a79:  mov    %rbx,0x0(%rbp)
    

    You can see that it is actually manually putting the return address from the native code onto the "stack" (which seems generally to be accessed with rbp). Following that, note that there is no call instruction to return_address. Rather, we have:

    0x557c9cf33a92:  movq   $0x400546,0x80(%r14)
    0x557c9cf33a9d:  mov    $0x7f177ad8a690,%rax
    0x557c9cf33aa7:  jmpq   0x557c9cef8196
    

    In most of the code, r14 seems to be a pointer to some internal QEMU data structure (i.e., not used to hold values from the emulated program). The above shoves 0x400546 (which is the address of the return_address function in native code) into a field of the structure pointed to by r14, sticks 0x7f177ad8a690 in rax, and jumps to 0x557c9cef8196. This last address appears all over the place in the generated code (but its definition does not) and seems to be some kind of internal dispatch or thunk method. Presumably, it uses either the native address or, more likely, the mystery value shoved in rax to dispatch to the translated return_address method, which looks like this:

    ----------------
    IN: return_address
    0x0000000000400546:  mov    (%rsp),%rax
    0x000000000040054a:  retq   
    
    OUT: [size=64]
    0x55c131ef9ad0:  mov    -0x8(%r14),%ebp
    0x55c131ef9ad4:  test   %ebp,%ebp
    0x55c131ef9ad6:  jne    0x55c131ef9b01
    0x55c131ef9adc:  mov    0x20(%r14),%rbp
    0x55c131ef9ae0:  mov    0x0(%rbp),%rbx
    0x55c131ef9ae4:  mov    %rbx,(%r14)
    0x55c131ef9ae7:  mov    0x0(%rbp),%rbx
    0x55c131ef9aeb:  add    $0x8,%rbp
    0x55c131ef9aef:  mov    %rbp,0x20(%r14)
    0x55c131ef9af3:  mov    %rbx,0x80(%r14)
    0x55c131ef9afa:  xor    %eax,%eax
    0x55c131ef9afc:  jmpq   0x55c131ebe196
    0x55c131ef9b01:  mov    $0x7f9ba51f7713,%rax
    0x55c131ef9b0b:  jmpq   0x55c131ebe196
    

    The first bit of code seems to set up the user "stack" in ebp (getting it from r14 + 0x20, which is probably the emulated machine state structure) and ultimately reads from the "stack" (the line mov 0x0(%rbp),%rbx) and stores that away into the area pointed to by r14 (mov %rbx,0x80(%r14)).

    Finally, it gets to jmpq 0x55c131ebe196, which transfers to the QEMU epilogue routine:

    0x55c131ebe196:  add    $0x488,%rsp
    0x55c131ebe19d:  pop    %r15
    0x55c131ebe19f:  pop    %r14
    0x55c131ebe1a1:  pop    %r13
    0x55c131ebe1a3:  pop    %r12
    0x55c131ebe1a5:  pop    %rbx
    0x55c131ebe1a6:  pop    %rbp
    0x55c131ebe1a7:  retq   
    

    Note that I use the word "stack" in quotes above. This is because this "stack" is the emulation of the stack as seen by the emulated program, not the true stack pointed to by rsp. The true stack pointed to by rsp is controlled by QEMU to implement the emulated control flow, and the emulated code doesn't access it directly.

    Some Things Can Change

    We see above that "stack" contents as seen by the emulated process are the same under QEMU, but details of the stack do change. For example, the address of the stack looks different under emulation than natively (i.e., the value of rsp rather than the stuff pointed to by [rsp]).

    This function:

    __attribute__ ((noinline)) void* return_address() {
        return __builtin_frame_address(0);
    }
    

    normally returns addresses like 0x7fffad33c100 but returns addresses like 0x40007ffd00 under QEMU. This should not be a problem, though, because no valid program should depend on the exact absolute value of the stack address. Not only is it generally not defined and unpredictable, but on recent operating systems it really is designed to be totally unpredictable due to stack ASLR (Linux and Windows both implement this). The program above returns a different address each time I run it natively (but the same address under QEMU).

    Self Modifying Code

    You also mentioned a concern about when the instruction stream is modified, and gave an example of loading a kernel module. First, at least for QEMU, code is only translated "on demand". Functions that could be called, but aren't in some particular run, are never translated (you can try it with a function that is called conditionally depending on argc). So in general, loading new code into the kernel, or into a process in user-mode emulation, is handled by the same mechanism: the code will simply be translated the first time it is called.

    If the code is actually self modifying—i.e., the process writes to its own code—then something has to be done, since without help QEMU would keep using the old translation. So, to detect self-modifying code without penalizing every write to memory, the native code lives in pages with R+X permissions only. The consequence is that a write raises a GP fault which QEMU handles by noting that the code has modified itself, invalidating the translation, and so on. Plenty of details can be found in this thread and elsewhere.

    It's a reasonable mechanism, and I expect other code-translation VMs do something similar.

    Note that in the case of self-modifying code, the "garbage collection" problem is simple: the emulator is informed about the SMC event as described above, and since it has to re-translate at this point, it throws away the old translation.