How stable is TSC (TimeStamp Counter) from user space for Intel x86-64 CPUs in 2020?

Some times I need a proper way to measure performance at nanosecond from my user space application in order to include the syscall delays in my measurement. I read many old (10yo) articles saying it isn't any stable and they are gonna remove it from the user space.

In 2020, for Intel 8th/9th generation x86-64 CPUs, how stable is it? Can we still use TSC assembly code in a safely manner?
What the best practices to use TSC in the user space nowadays?

Links:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf

Solution

It's as stable as the clock crystal on your motherboard, but it's locked to a reference frequency (which depends on the CPU model), not the current CPU core clock frequency. That change was about 15 years ago (constant_tsc CPU feature) making it usable for wall-clock timing instead of cycle counting.

For example, the Linux VDSO user-space implementation of clock_gettime uses rdtsc and a scale factor to calculate an offset from the less-frequently-updated timestamp updated by the kernel's timer interrupt. (VDSO = pages of code and data owned by the kernel, mapped read-only into user-space processes.)

Note, inside a VM, it can cause a VM exit, or the VM might take advantage of hardware support for scaling and offsetting the TSC (typically if a running VM was migrated to hardware with a different TSC frequency and/or start time, in order to preserve TSC invariance for the guest.)

What the best practices to use TSC in the user space nowadays?

If you want to count core clock cycles, use rdpmc (with a HW perf counter programmed appropriately and set up so user-space is allowed to read it.) Or user perf or other way of using HW perf counters.

But other than that, you can use rdtsc directly or indirectly via wrapper libraries.

Depending on your overhead requirements, and how much effort you're willing to put into finding out TSC frequency so you can relate TSC counts to seconds, you might just use it via std::chrono or libc clock_gettime which don't need to actually enter the kernel thanks to the VDSO.

How to get the CPU cycle count in x86_64 from C++? - my answer there has more details about the TSC, including how it worked on older CPUs, and the fact that out-of-order execution means you need lfence before/after rdtsc if you want to wait for earlier code to finish executing before it reads the internal TSC. Or use rdtscp

Measuring chunks of code shorter than a few hundred instructions introduces the complication that throughput and latency are different things, it's not meaningful to measure performance with just a single number. Out-of-order exec means that the surrounding code matters.

and they are gonna remove it from the user space.

x86 has basically never removed anything, and definitely not from user-space. Backwards compat with existing binaries is x86's main claim to fame and reason for continued existence.

rdtsc is documented in Intel and AMD's x86 manuals, e.g. Intel's vol.2 entry for it. There is a CPU feature that lets the kernel disable RDTSC for user-space (TSD = TimeStamp Disable) but it's not normally used on Linux. (Note the #GP(0) exception: If the TSD flag in register CR4 is set and the CPL is greater than 0 - kernel is Current Privilege Level = 0 aka ring 0, higher = user-space.

IDK if there are any plans to use TSD by default; I'd assume not because it's a useful and efficient timesource. Even if so, on a dev machine where you want to do profiling / microbenchmarking you'd be able to toggle that feature. (Although usually I just put stuff in a large-enough repeat loop in a static executable and run it under perf stat to get total time and HW perf counters.)