multithreading cpu cpu-architecture hyperthreading

x86 Hyper-threading clarification on cache miss

If I understood correctly on x86 cpu hyper threading is beneficial especially when we have IO calls so that when blocking thread is idle another thread can cheaply work on the same CPU. My question is if the same thing happens on a cache miss as well. So while waiting the data to be fetch from the main memory with hundreds of cycles can other thread execute some code on the same physical CPU?

Solution

The answer is - yes, within some limits.

Hyperthreading indeed allows you to do fine-grained interleaving of two program contexts (to which the OS may attach software threads). Instructions and cached data from both threads will coexist in the core simultaneously.

Now, in modern CPUs, you don't simply have one big pipeline where you can arbitrate at the beginning for each cycle. Instead, each unit (Memory unit, caches, execution unit, Out-of order components, etc), has it's own pipeline and communication channels with the other units. Some of these units are partitioned to support 2 threads, and may choose on each cycle where to take the next task from (assuming they have separate incoming queues to choose from). Other units may be duplicated between the threads, or arbitrate based on other metrics, the exact choice if of course implementation-specific, but whenever there's a threaded arbitration, the hardware will attempt to balance the choices (doing a round-robin for example).

Saying that a thread is stalled is also not simple, a thread with a pending memory request may still fetch ahead, and even execute non-dependent operations. A thread recovering from a branch misprediction can already be prefetching ahead the correct path - things always get done. However, on any given unit, you may discover that one thread is indeed stuck, in which case the arbitration will usually favor the other one. As a result, when a thread has trouble progressing on some part of the CPU, the other thread will effectively get a larger time share of that resource. However, the blocked thread may still be using parts of that resource in a way that limits the free one, so it's wrong to say that when one thread is blocked, the other gets a free reign over the core, or even some unit. It simply gets a better share.

For example, since you asked about memory accesses - when a thread misses the cache and goes outside (to a next level cache or to main memory), the required data will probably cause a data dependency stall that prevents younger instructions from being executed, and therefore possibly reducing future memory accesses that were dependent (if you were traversing a linked list, you're stuck, but if you were traversing an array, you wouldn't even notice that). This will give the other thread more of the memory unit (and the most important resource there - miss buffers to hold requests that need to be sent outside). On the long term, it would probably show a little better performance as a result.