Is it possible that a Process inside a VM guest uses the VMX (AMD-V, VT-x) CPU instructions, that are then processed by the outer VMM instead of directly on the CPU?
Edit: Assume that the outer VM uses VMX itself to manage its virtual guest machine (i.e. it runs in Ring -1).
If it is possible are there any implementations of VMMs that support emulating/intercepting VMX calls (VMware, Parallels, KVM,...)?
Nor the Intel's VT-x nor the AMD's AMD-V support a fully recursive virtualization in hardware - where the CPU keep a hierarchy of nested virtualized environments in the same fashion of a call
/ret
pair.
A logical processor only supports two modes of operation: the host mode (called VMX root mode in Intel terminology, hypervisor in AMD's one) and the guest mode (called as such in AMD's manuals and VMX non-root mode in Intel's ones).
This implies a flattened hierarchy where every virtualized environment is treated the same by the CPU - the CPU is unaware of how many levels the hierarchy of VMs is deep.
An attempt to use the virtualization instructions them-selves inside a guest will yield control to the monitor (VMM).
But some support for accelerating frequently used virtual instructions has appeared recently making nested VM possible.
I'll try to analyse the issues to face to implement a nested virtualization.
I'm not dealing with the whole thing - I'm considering the base case only leaving out all the part dealing with the virtualization of the hardware; a part that itself is as problematic as the virtualization of the software.
Note
I'm not an expert on virtualization technology and have no experience on it at all - corrections are welcome.
The purpose of this answer is to make the reader conceptually believe that nested virtualization is possible and outline the problems to face.
A logical processor enters the VMX operation by executing vmxon
- as soon as the mode is entered the processor is in root mode.
Root mode is the mode of the VMM, it can launch, resume and handle the VMs.
The VMM then set the current VMCS (VM Control Structure) with vmptrld
- the VMCS contains all the metadata necessary to virtualise a guest.
The VMCS is read and written not with direct memory accesses‡ but with vmread
and vmwrite
instructions.
Finally, the VMM executes vmlaunch
to start executing the guest.
Now the logical processor is executing in a virtualized environment.
Suppose the guest is a VMM itself and let's call this the non-root VMM - it needs to repeat the steps above.
But Intel is clear in its manuals (Manual 3 - Chapter 25.1.2):
The following instructions cause VM exits when they are executed in VMX non-root operation:
[...]
This is also true of instructions introduced with VMX, which include:
[...],VMLAUNCH
,VMPTRLD
, [...] andVMXON
vmxon
this instruction causes a VM Exit, the root VMM resume from the instruction after its last vmlaunch
, can inspect the VMCS for the reason of the exit and take appropriate action.
I'm not a seasoned VMM writer so I'm not sure what the root VMM have to do exactly to emulate this instruction - since executing a vmxon
in VMX root mode will fail and doing a vmxoff
followed by a vmxon
with VM Region given by the non-root VMM seems a security vulnerability (or a lead to it) I believe that all the root VMM has to do is record that the guest is now in "VMX root mode".
The quotes are necessary here: this mode exists only in software when the root VMM will handle the control back to the non-root VMM the CPU will be in non-root VMX mode.
After that, the non-root VMM will attempt to use vmptrld
to set the current VMCS.
vmptrld
will induce a VM exit and the root VMM is in control once again - if the CPU doesn't support VMCS shadowing the root VMM has to record that the pointer given by the non-root VMM is now the current VMCS - if the CPU does support VMCS shadowing the VMM set the VMCS link pointer field of its VMCS (the one used to virtualise the non-root VMM) to the VMCS given by the non-root VMM.
One way or another the VMM knows which virtualised VMCS is active.
vmread
and vmwrite
executed by the non-root VMM will or will not cause a VM exit.
If VMCS shadowing is active the CPU won't do a VM Exit and instead will read the VMCS pointed by the VMCS link pointer in the active VMCS (called the shadow VMCS).
This will speed up virtualization of nested VMs.
If VMCS shadowing is not active the CPU will VM exit and the root VMM has to emulate the read/write.
Finally, the non-root VMM will launch its VM - this is a nested VM.
vmlaunch
will trigger a VM Exit.
The root VMM has to do a few things:
vmlaunch
/vmresume
.Now the CPU is executing the nested VM (a VVM - Virtual VM?).
What happens when a sensitive instruction or an event causes a VM Exit?
From the processor point of view, there are only two levels of virtualization: the root VMX mode and the non-root VMX mode.
Since the guest is in non-root VMX mode, control is transferred back to the root VMX mode code - i.e. the root VMM.
The root VMM now must understand if that event is from its VM or from its VM's VM.
This can be done by tracking the use of vmlaunch
/vmresume
and checking the bits in the VMCS.
If the VM Exit is directed to the non-root VMM the root VMM has to load its original VMCS, eventually set in it the link the non-root VMM, update the non-root VMM VMCS status bits and do a vmresume
.
If the VM Exit is directed to it, the root VMM will handle it as any other VM Exit.
What if we want to create a VM inside the nested VM? Kind of a Virtual Virtual VM (VVVM).
There are two things to notice:
There is no root vs non-root mode in AMD-v, the CPU starts executing a VM with vmrun
that takes a pointer to a VMCB (VM Control Block) that serves the same purpose of the Intel's VMCS.
Upon a vmrun
the CPU is in guest mode.
The VMCB is cached but it can only be read with usual memory accesses.
The vmload
/vmsave
instructions explicitly load into and save from the cache the VMCB fields subject to caching.
This interface is easier than Intel's one but it is as powerful - even when it comes to nesting virtualization.
Assume we are inside a VM and the code executes a vmrun
- thus we are virtualizing a VMM.
Technically a VMM can choose whenever vmrun
will or will not trigger a VM Exit.
Practically, however, AMD-v currently require the former to always be the case:
The following conditions are considered illegal state combinations: [...]
* TheVMRUN
intercept bit is clear
Thus the root VMM (I'll use the same terminology as in the Intel case) will gain control and has to emulate a vmrun
(since the hardware only support a single level of virtualisation).
The root VMM can save and merge the current VMCB with the non-root VMM VMCB and go ahead with the vmrun
as in the Intel case.
Upon an exit the root-VMM has to determine if the exit was directed to it or to the non-root VMM, again this can be done tracking the vmrun
and the control bits in the VMCB.
We have set up a VM inside a VM relatively easy - now what happens upon a VM Exit?
The root VMM receives the exit and if directed to the non-root VMM is has to restore its original VMCB and resume the run (i.e. use vmrun
with its original VMCB).
AMD-v supports a fast virtualisation of the vmsave
and vmload
instructions by considering their addresses guest addresses and thus subject to the usual page-nesting virtualisation.
As with the Intel case, the virtualization can be nested again as long as the VMM support that features.
The critical security warning noted for the Intel's case is valid for the AMD's one as well.
‡ Due to its implementation-defined format and the fact the memory area can be used just as a spill area that is not updated in real time