caching memory x86 operating-system kernel

Fastest way for one core to signal another?

On an Intel CPU, I want CPU core A to signal CPU core B when A has completed an event. There are a couple ways to do this:

A sends an interrupt to B.
A writes a cache line (e.g., a bit flip) and B polls the cache line.

I want B to learn about the event with the least amount of overhead possible. Note that I am referring to overhead, not end-to-end latency. It's alright if B takes a while to learn about the event (e.g., periodic polling is fine), but B should waste as few cycles as possible detecting the event.

Option 1 above has too much overhead due to the interrupt handler. Option 2 is better, but I am still unhappy with the amount of time that B must wait for the cache line to transfer from A's L1 cache to its own L1 cache.

Is there some way A can directly push the cache line into B's L1 cache? It's fine if there is additional overhead for A in this case. I'm not sure if there some trick I can try where A marks the page as uncacheable and B marks the page as write-back...

Alternatively, is there some other mechanism built into Intel processors that can help with this?

I assume this is less of an issue on AMD CPUs as they use the MOESI coherence protocol, so the "O" should presumably allow A to broadcast the cache line changes to B.

Solution

There's disappointingly little you can do about this on x86 without some very recent ISA extensions, like cldemote (Tremont or Alder Lake / Sapphire Rapids) or user-space IPI (inter-processor interrupts) in Sapphire Rapids, and maybe also Alder Lake. (See Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions? for details on UIPI.)

Without any of those features, the choice between occasional polling (or monitor/mwait if the other core has nothing to do) vs. interrupt depends on how many times you expect to poll before you want to send a notification. (And how much potential throughput you'll lose due to any knock-on effects from the other thread not noticing the flag update soon, e.g. if that means larger buffers leading to more cache misses.)

In user-space, other than shared memory or UIPI, the alternatives are OS-delivered inter-process-communications like a signal or a pipe write or eventfd; the Linux UIPI benchmarks compared it to various mechanisms for latency and throughput IIRC.

AMD CPUs don't broadcast stores; that would swamp the interconnect with traffic and defeat the benefit of private L1d cache for lines that get repeatedly written (between accesses from other cores, even if it avoided it for lines that weren't recently shared.)