Why is RDTSC a virtualized instruction on modern processors?

I am studying RDTSC and learning about how it is virtualized for the purposes of virtual machines like VirtualBox and VMWare. Why did Intel/AMD go to all the trouble of virtualizing this instruction?

I feel like it can be easily simulated with a trap and it's not exactly a super-common instruction (I tested and there's no noticable slow-down for general usage in a virtual machine where hardware RDTSC virtualization is disabled).

However, I know Intel/AMD wouldn't have gone to all the trouble to add this instruction to the virtualizing hardware unless it was important to able to execute very fast.

Does anyone know why?

Solution

Its common to use RDTSC to get fine-grained timing information, where the overhead of a virtualization trap would be quite significant. Most common use is to have two RDTSC instructions with a small amount of code between them, taking the difference of the times as the elapsed time (number of cycles) for the code sequence. So even the overhead of pipeline drains/flushes is quite significant.

Also, since all the instruction does is read a continuously running counter, virtualizing it is quite easy -- the hardware only needs to allow saving/reloading the counter value on VM context switches, and not anything special for the RDTSC instruction itself.