Is there any way to write for Intel CPU direct core-to-core communication code?

I want to ping threads to all cores in two CPU socket, and write communications between the threads without write back to DRAM.

Write back to cache would be fine for my throughput if I only use the cores in one sockets, but for two socket, I wonder if there is anything faster, like on chip network or Intel QuickPath Interconnect?

What's more, is there any easy way to exploit such feature without write the assembly code directly?

ref: https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/700477

Solution

TL:DR: no, CPU hardware is already optimized for one core storing, another core loading. There's no magic high-performance lower-latency method you can use instead. If the write side can force write-back to L3 somehow, that can reduce latency for the read-side, but unfortunately there's no good way to do that (except on Tremont Atom, see below).

(Update: UIPI in Sapphire Rapids is a way for user-space to send inter-processor interrupts for low-latency IPC without polling memory or making system calls. Fastest way for one core to signal another? / Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions?)

Shared last-level cache already backstops coherency traffic, avoiding write/re-read to DRAM.

Don't be fooled by MESI diagrams; those show single-level caches without a shared cache.

In real CPUs, stores from one core only have to write-back to last-level cache (LLC = L3 in modern x86) for loads from other cores to access them. L3 can hold dirty lines; all modern x86 CPUs have write-back L3 not write-through.

On a modern multi-socket system, each socket has its own memory controllers (NUMA) so snooping detects when cache->cache transfers need to happen over the interconnect between sockets. But yes, pinning threads to the same physical core does improve inter-core / inter-thread latency. (Similarly for AMD Zen, where clusters of 4 cores share a chunk of LLC, within / across clusters matters for inter-core latency even within a single socket because there isn't one big LLC shared across all cores.)

You can't do much better than this; a load on one core will generate a share request once it reaches L3 and finds the line is Modified in the private L1d or L2 of another core. This is why latency is higher than an L3 hit: the load request has to get L3 before it even knows it's not just going to be an L3 hit. But Intel uses its large shared inclusiv L3 cache tags as a snoop filter, to track which core on the chip might have it cached. (This changed in Skylake-Xeon; its L3 is no longer inclusive, not even tag-inclusive, and must have some separate snoop filter.)

Fun fact: on Core 2 CPUs traffic between cores really was as slow as DRAM in some cases, even for cores that shared a last-level L2 cache.

Early Core 2 Quad CPUs were really two dual-core dies in the same package, and didn't share a last-level cache. And worse, the interconnect between dies was via the frontside bus, so was about as slow as one core accessing DRAM, even if the data didn't actually have to get written to DRAM.

But those days are long past; modern multi-core and multi-socket CPUs are about as optimized as they can be for inter-core traffic. (Zen has multiple CCXs without a package-wide shared level of cache outside that, but the interconnect between CCXs is pretty good. Grouping threads onto cores in the same CCX makes inter-core latency lower, but makes them compete for L3. So it's a tradeoff depending on how much they need to interact and wait for each other, if all cores in another CCX are idle.)

You can't really do anything special on the read side that can make anything faster.

If you had cldemote on the write side, or other way to get data evicted back to L3, the read side could just get L3 hits. But that's only available on Tremont Atom

x86 MESI invalidate cache line latency issue is another question about trying to get the write side to evict cache lines back to L3, this one via conflict misses.

clwb would maybe work to reduce read-side latency, but the downside is that it forces a write-back all the way to DRAM, not just L3. (And on Skylake-Xeon it does evict, like clflushopt. Hopefully IceLake will give us a "real" clwb.)

How to force cpu core to flush store buffer in c? is another question about basically the same thing.