How does a bare metal hypervisor and the operating system it hosts coordinate on system calls?

I have read a great deal about bare metal hypervisors, but never quite get the way they interact with an OS they are hosting.

Suppose you have Unix itself on bare metal. When in user mode, you can't touch or affect the OS internals. You get things done by a system call that gets trapped, sets the machine to kernel mode, then does the job for you. For example, in C, you might malloc() a bunch, then eventually run out of initially allocated memory. If memory serves me right, malloc — when it knows it is out of memory — must make the system call to what I believe is break(). Once in kernel mode, your process's page table can be extended, then it returns and malloc() has the required extra memory (or something like that).

But if you have Unix on top of a bare metal hypervisor, how does this actually happen? The hypervisor, it would seem, must have the actual page tables for the whole system (across OSs, even). So Unix can't be in kernel mode when a system call to Unix gets made, otherwise it could mess with other OSs running at the same time. On the other hand, if it is running in User mode, how would the code that implements break ever let the hypervisor know it wants more memory without the Unix code being rewritten?

Solution

In most architectures another level is added beyond supervisor, and supervisor is somewhat degraded. The kernel believes itself to be in control of the machine, but that is an illusion crafted by the hypervisor.

In ARM, user mode is 0, system is 1, hypervisor is 2. Intel were a bit short sighted (gasp) and had user as 3, supervisor as 0, thus hypervisor is a sort of -1. Obviously its not -1, but that is a handy shorthand to the intensely ugly interface they constructed for this handling.

In most architectures, the hypervisor gets to install an extra set of page tables which take effect after then guest's page tables do. So, your unix kernel thinks it was loaded at 1M physical, could be at any arbitrary address, and every address your unix kernel thinks is contiguous at a page boundary could be scatter over a vast set of actual (bus) addresses.

Even if your architecture doesn't permit an extra level of page tables, it is straightforward enough for a hypervisor to "trap & emulate" the page tables constructed by the guest, and maintain an actual set in a completely transparent fashion. The continual motion towards longer pipelines, however, increases the cost of each trap, thus an extra level page table is much appreciated.

So, your UNIX thinks it has all 8M of memory to itself; however unbeknownst to it, a sneaky hypervisor may be paging that 8M to a really big floppy drive, and only giving it a paltry 640K of real RAM. All the normal unix-y stuff works fine, except that it may have a pretty trippy sense of time, where time slows and speeds up in alternating phases, as they hypervisor attempts to pretend that a 250msec floppy disk access completed in the time of a 60nsec dram access.

This is where hypervisors get hard.