C# How volatile and interlocked affect cpu cache

I'm trying to understand how volatile reads/writes and interlocked operations in c# affect the processor cache.

I have read in some places that both of those operations flush the processor cache. I would like to know whether it is true?
If they flush the cache, how do they do it. For example, which assembly instructions are used?
If they flush the cache, does the whole cache of the executing cpu get flushed? Or just the cache line?
If they flush the cache, how would it affect caches of the other processors? Do they invalidate the other cpu caches, so code executing in the other processors get the updated values with non-volatile reads?

I could not find much information online about this. I am that good with assemby languages so I could not figure it out by myself. Aprreciate any thoughts on this.

Solution

Cache is coherent and doesn't need to be flushed for other cores to see your stores on all CPU cores that we run multiple threads across (see this). https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/ is also fairly good. So there's no reason for atomic operations to ever flush a cache line they operate on.

1. is false. Interlocked operations flush the store buffer to cache, and are a full barrier for memory ordering, but do not evict the line from cache afterwards. Repeated lock add [mem], 1 (x86 InterlockedAdd) can hit in cache.

(volatile stores are similar, but without being an RMW. volatile loads have no effect on cache beyond what a plain load would. I don't know if C# volatile stores flush the store buffer, or if they're just release (not seq_cst) ordering. Most of the rest of this answer applies to volatile as well as Interlocked, but I was thinking just atomic RMW when I wrote it.)

After an Interlocked operation, the line will be in MESI Modified state in this core's cache. (Although if a request for the line from another core arrived while that atomic RMW was keeping it owned by this core, it might change MESI state to shared or invalid the moment the RMW finishes, but that would be due to the request from the other core being handled, not due to the Interlocked operation itself.) That means it will be Invalid in all other core's private caches.

(Microsoft's "Interlocked*" functions are just AFAIK (on x86) wrappers for instructions with a lock prefix, so everything you read about them applies, and blocking compile-time reordering around the operation even if it inlines. If C# Interlocked operations aren't exactly like the Windows C functions of the same name, this answer might need be fully correct, but I expect they'd follow the same design for the same named functions, because atomic RMW with seq_cst ordering is a pretty standard thing to expose.)

See https://preshing.com/20120930/weak-vs-strong-memory-models/ - x86 has a "strong" memory model, and Interlocked functions on weakly-ordered machines are atomic RMW + full barrier just like on x86. (But it takes multiple instructions to make that happen, with the barriers separate from the atomic RMW.) Like C++ memory_order_seq_cst operations.

For more about store buffers and why CPUs need them, see Can a speculatively executed CPU branch contain opcodes that access RAM?. Store buffers naturally create StoreLoad reordering (the only kind x86 allows, and the most expensive kind to block.)