CPU cache inhibition

Say I have the defacto standard x86 CPU with 3 level of Caches, L1/L2 private, and L3 shared among cores. Is there a way to allocate shared memory whose data will not be cached on L1/L2 private caches, but rather it will only be cached at L3? I don't want to fetch data from memory (that's too costly), but I'd like to experiment with performance with and without bringing the shared data into private caches.

The assumption is that L3 is shared among the cores (presumably physically indexed cache) and thus will not incur any false sharing or cache line invalidation for heavily used shared data.

Any solution (if it exists) would have to be done programmatically, using C and/or assembly for intel based CPUs (relatively modern Xeon architectures (skylake, broadwell), running linux based OS.

Edit:

I have latency sensitive code which uses a form of shared memory for synchronization. The data will be in L3, but when read or written to it will go into L1/L2 depending on cache inclusivity policy. By implication of the problem, the data will have to be invalidated adding an unnecessary (I think) performance hit. I'd like to see if it's possible to just store the data, either through some page policy or special instructions only in L3.

I know it's possible to use the special memory register to inhibit caching for security reasons, but that requires CPL0 privilege.

Edit2:

I'm dealing with parallel codes that run on high performance systems for months at at time. The systems are high core-count systems (eg. 40-160+ cores) that periodically perform synchronization which needs to execute in usecs.

Solution

x86 has no way to do a store that bypasses or writes through L1D/L2 but stops at L3. There are NT stores which bypass all cache. Anything that forces a write-back to L3 also forces write-back all the way to memory. (e.g. a clwb instruction). Those are designed for non-volatile RAM use cases, or for non-coherent DMA, where it's important to get data committed to actual RAM.

(Update: Tremont / Sapphire Rapids have cldemote. Earlier hardware runs it as a NOP, so it's usable as a hint.)

There's also no way to do a load that bypasses L1D (except from WC memory like video RAM with SSE4.1 movntdqa, but it's not "special" on other memory types). prefetchNTA can bypass L2, according to Intel's optimization manual. (And bypass L3 on Xeons with non-inclusive L3 cache.)

Prefetch on the core doing the read should be useful to trigger write-back from other core into L3, and transfer into your own L1D. But that's only useful if you have the address ready before you want to do the load. (Dozens to hundreds of cycles for it to be useful.)

Intel CPUs use a shared inclusive L3 cache as a backstop for on-chip cache coherency. 2-socket has to snoop the other socket, but Xeons that support more than 2P have snoop filters to track cache lines that move around. (This is describing up to Broadwell Xeon v4, not the redesign for Skylake and later Xeon Scalable.)

When you read a line that was recently written by another core, it's always Invalid in your L1D. L3 is tag-inclusive, and its tags have extra info to track which core has the line. (This is true even if the line is in M state in an L1D somewhere, which requires it to be Invalid in L3, according to normal MESI.) Thus, after your cache-miss checks L3 tags, it triggers a request to the L1 that has the line to write it back to L3 cache (and maybe to send it directly to the core than wants it).

Skylake-X (Skylake-AVX512) doesn't have an inclusive L3 (It has a bigger private L2 and a smaller L3), but it still has a tag-inclusive structure to track which core has a line. It also uses a mesh instead of ring, and L3 latency seems to be significantly worse than Broadwell. (Especially in first-gen Skylake; I think less bad in Ice Lake and later Xeons.)

Possibly useful: map the latency-critical part of your shared memory region with a write-through cache policy. IDK if this patch ever made it into the mainline Linux kernel, but see this patch from HP: Support Write-Through mapping on x86. (The normal policy is WB.)

Also related: Main Memory and Cache Performance of Intel Sandy Bridge and AMD Bulldozer, an in-depth look at latency and bandwidth on 2-socket SnB, for cache lines in different starting states.

For more about memory bandwidth on Intel CPUs, see Enhanced REP MOVSB for memcpy, especially the Latency Bound Platforms section. (Having only 10 LFBs limits single-core bandwidth).

Related: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? has some experimental results for having one thread spam writes to a location while another thread reads it.

Note that the cache miss itself isn't the only effect. You also get a lot of machine_clears.memory_ordering from mis-speculation in the core doing the load. (x86's memory model is strongly ordered, but real CPUs speculatively load early and abort in the rare case where the cache line becomes invalid before the load was supposed to have "happened".

Also related:

Is there any way to write for Intel CPU direct core-to-core communication code? (no, except IPI interrupts)
Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions? - UIPI exists now, in Sapphire Rapids: user-space inter-processor interrupts.