Why do we even need cache coherence?

In languages like C, unsynchronized reads and writes to the same memory location from different threads is undefined behavior. But in the CPU, cache coherence says that if one core writes to a memory location and later another core reads it, the other core has to read the written value.

Why does the processor need to bother exposing a coherent abstraction of the memory hierarchy if the next layer up is just going to throw it away? Why not just let the caches get incoherent, and require the software to issue a special instruction when it wants to share something?

Solution

The acquire and release semantics required for C++11 std::mutex (and equivalents in other languages, and earlier stuff like pthread_mutex) would be very expensive to implement if you didn't have coherent cache. You'd have to write-back every dirty line every time you released a lock, and evict every clean line every time you acquired a lock, if couldn't count on the hardware to make your stores visible, and to make your loads not take stale data from a private cache.

But with cache coherency, acquire and release are just a matter of ordering this core's accesses to its own private cache which is part of the same coherency domain as the L1d caches of other cores. So they're local operations and pretty cheap, not even needing to drain the store buffer. The cost of a mutex is just in the atomic RMW operation it needs to do, and of course in cache misses if the last core to own the mutex wasn't this one.

C11 and C++11 added stdatomic and std::atomic respectively, which make it well-defined to access shared _Atomic int variables, so it's not true that higher level languages don't expose this. It would hypothetically be possible to implement on a machine that required explicit flushes/invalidates to make stores visible to other cores, but that would be very slow. The language model assumes coherent caches, not providing explicit flushes of ranges but instead having release operations that make every older store visible to other threads that do an acquire load that syncs-with the release store in this thread. (See When to use volatile with multi threading? for some discussion, although that answer is mainly debunking the misconception that caches could have stale data, from people mixed up by the fact that the compiler can "cache" non-atomic non-volatile values in registers.)

In fact, some of the guarantees on C++ atomic are actually described by the standard as exposing HW coherence guarantees to software, like "write-read coherence" and so on, ending with the note:

http://eel.is/c++draft/intro.races#19

[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note

(Long before C11 and C++11, SMP kernels and some user-space multithreaded programs were hand-rolling atomic operations, using the same hardware support that C11 and C++11 finally exposed in a portable way.)

Also, as pointed out in comments, coherent cache is essential for writes to different parts of the same line by other cores to not step on each other.

ISO C11 guarantees that a char arr[16] can have arr[0] written by one thread while another writes arr[1]. If those are both in the same cache line, and two conflicting dirty copies of the line exist, only one can "win" and be written back. C++ memory model and race conditions on char arrays

ISO C effectively requires char to be as large as smallest unit you can write without disturbing surrounding bytes. On almost all machines (not early Alpha and not some DSPs), that's a single byte, even if a byte store might take an extra cycle to commit to L1d cache vs. an aligned word on some non-x86 ISAs.

The language didn't officially require this until C11, but that just standardized what "everyone knew" the only sane choice had to be, i.e. how compilers and hardware already worked.