I am trying to figure out the performance impact of Intel HT in Linux X86_64.
Is there a well know tool or ready-to-use code to do this testing ?
If not, my test plan is as follows,
Scenario 1:
Thread 1: High priority, run in CoreN Thread0, sleep 1 second.
Thread 2: Mid priority, run in CoreN Thread0, increase an integer counter.
Thread 3 and 4 are the same threads as 1 and 2, but will run in CoreN Thread1.
After 1 second, Thread 1 and 3 will print the counters increased by thread 2 and 4 individually.
Scenario 2:
Then move thread 3 and 4 to different core, run for 1 second to check the counter again.
The expectation is performance of integer adding in scenario 2 is better than scenario 1.
Is this my testing plan reasonable to check the Intel HT performance impact?
Your way of testing might make sense if your workload is inherently a fixed number of threads that's more than the number of physical cores. So you need to compare 2 threads competing for the same core (context-switching) to two threads sharing logical cores of the same physical core.
That's not normal, most multi-threaded workloads can divide themselves across a variable number of threads, so you can choose a number of threads that matches your cores.
Usually you'd do something like x265
using N threads, where N is the number of physical cores you have. (Like ffmpeg -preset slow -c:v libx265 -x265-params pools=4
for one NUMA pool of 4 cores). Ideally with HT disabled at boot, or taking one core of each HT pair offline, so Linux never schedules two threads onto the same physical core.
Then using 2N threads, keeping all the logical cores busy, so see if scaling to more threads helps or hurts throughput for your workload. (hiding stalls vs. creating more stalls by competing for cache footprint / memory bandwidth.)
In my testing, without bothering to offline cores, just pools=4 vs. pools=8 on an i7-6700k Skylake with dual-channel DDR4-2666, 1080p x265 encoding at -preset slower speeds up by about 20% with pools=8 vs. pools=4.
But 8 threads uses significantly more memory bandwidth (according to intel_gpu_top -l
to show integrated memory-controller read/write bandwidth), and makes interactive use significantly more sluggish. (Either from the extra competition for L3 cache, or from not having free logical cores to schedule tasks onto, or both.)
Or if you want to microbenchmark to run two simple loops against each other for a long time (instead of the instruction mix of a real program like x265 or a BLAS SGEMM, or a make -j8
compile, or something), then yeah you'd write simple loops and run them under perf stat
to see if reality matches what you might predict from the code having a front-end vs. back-end (especially different specific ports) vs. latency bottleneck.
See https://stackoverflow.com/tags/x86/info and especially https://agner.org/optimize/ - Agner's microarch guide has fairly detailed info on how different parts of the CPU core are shared between hyper-threads. (e.g. ROB and store buffer are statically partitioned, cache and execution units are competitively shared, front-end alternates unless one thread is stalled.)