Search code examples
performanceconcurrencyx86hyperthreading

What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?


Two different threads within a single process can share a common memory location by reading and/or writing to it.

Usually, such (intentional) sharing is implemented using atomic operations using the lock prefix on x86, which has fairly well-known costs both for the lock prefix itself (i.e., the uncontended cost) and also additional coherence costs when the cache line is actually shared (true or false sharing).

Here I'm interested in produced-consumer costs where a single thread P writes to a memory location, and another thread `C reads from the memory location, both using plain reads and writes.

What is the latency and throughput of such an operation when performed on separate cores on the same socket, and in comparison when performed on sibling hyperthreads on the same physical core, on recent x86 cores.

In the title I'm using the term "hyper-siblings" to refer to two threads running on the two logical threads of the same core, and inter-core siblings to refer to the more usual case of two threads running on different physical cores.


Solution

  • The killer problem is that the cores makes speculative reads, which means that each time a write to the the speculative read address (or more correctly to the same cache line) before it is "fulfilled" means the CPU must undo the read (at least if your an x86), which effectively means it cancels all speculative instructions from that instruction and later.

    At some point before the read is retired it gets "fulfilled", ie. no instruction before can fail and there is no longer any reason to reissue, and the CPU can act as-if it had executed all instructions before.

    Other core example

    These are playing cache ping pong in addition to cancelling instructions so this should be worse than the HT version.

    Lets start at some point in the process where the cache line with the shared data has just been marked shared because the Consumer has ask to read it.

    1. The producer now wants to write to the shared data and sends out a request for exclusive ownership of the cache line.
    2. The Consumer receives his cache line still in shared state and happily reads the value.
    3. The consumer continues to read the shared value until the exclusive request arrives.
    4. At which point the Consumer sends a shared request for the cache line.
    5. At this point the Consumer clears its instructions from the first unfulfilled load instruction of the shared value.
    6. While the Consumer waits for the data it runs ahead speculatively.

    So the Consumer can advance in the period between it gets it shared cache line until its invalidated again. It is unclear how many reads can be fulfilled at the same time, most likely 2 as the CPU has 2 read ports. And it properbly doesn't need to rerun them as soon as the internal state of the CPU is satisfied they can't they can't fail between each.

    Same core HT

    Here the two HT shares the core and must share its resources.

    The cache line should stay in the exclusive state all the time as they share the cache and therefore don't need the cache protocol.

    Now why does it take so many cycles on the HT core? Lets start with the Consumer just having read the shared value.

    1. Next cycle a write from the Produces occures.
    2. The Consumer thread detects the write and cancels all its instructions from the first unfulfilled read.
    3. The Consumer re-issues its instructions taking ~5-14 cycles to run again.
    4. Finally the first instruction, which is a read, is issued and executed as it did not read a speculative value but a correct one as its in front of the queue.

    So for every read of the shared value the Consumer is reset.

    Conclusion

    The different core apparently advance so much each time between each cache ping pong that it performs better than the HT one.

    What would have happened if the CPU waited to see if the value had actually changed?

    For the test code the HT version would have run much faster, maybe even as fast as the private write version. The different core would not have run faster as the cache miss was covering the reissue latency.

    But if the data had been different the same problem would arise, except it would be worse for the different core version as it would then also have to wait for the cache line, and then reissue.

    So if the OP can change some of roles letting the time stamp producer read from the shared and take the performance hit it would be better.

    Read more here