Sequential program cache consistency

I wanted to ask you a question regarding the consistency of the cache memory. If I have a sequential program, I shouldn't have cache consistency problems because in any case the instructions are executed sequentially and consequently there is no danger that several processors will write the same memory location at the same time, in case there are is the shared memory.

Different case is the situation where I have a parallel program, so it runs on multiple processors and there is a high probability that there are cache consistency problems.

Quite right?

Solution

In a single-threaded program, unless otherwise programmed, it doesn't change the thread by itself, except if OS does (and when it does, all the same thread-states are re-loaded from memory into that cache so there is no problem about coherence in there).

In a multi-threaded program, an update on same variable found on other caches needs to inform those caches somehow. This causes a re-flow of data through all other caches. Maybe it's not a blocking effect on same thread but once user wants only updated values, the synchronization / locking will see a performance hit. Especially when there are also other variables being updated on very close addresses such that they're in same cache-line. That's why using 20-byte elements for locking resolution is worse than using 128-byte elements in an array of locks.

If CPUs did not have coherence, multi-threading wouldn't work efficiently. So, for some versions, they chose to broadcast an update to all caches (as in Snoop cache). But this is not efficient on high number of cores. If 1000 cores existed in same CPU, it would require a 1000-way broadcasting logic consuming a lot of area of circuitry. So they break the problem into smaller parts and add other ways like directory-based coherence & multiple chunks of multiple cores. But this adds more latency for the coherence.

On the other hand, many GPUs do not implement automatic cache coherence because

the algorithm given by developer is generally embarrassingly parallel with only few points of synchronization and multiple blocks of threads do not require to communicate with other blocks (when they do, they go through a common cache by developer's choice of instructions anyway)
there are thousands of streaming pipelines (not real cores) that just need to make memory requests efficiently or else there wouldn't be enough space for that many pipelines
high throughput is required instead of low-latency (no need for implicit coherence anywhere)

so multi-processors in a GPU are designed to do completely independent work from each other and adding automatic coherence would add little performance (if not subtract). When developer needs to synchronize data between multiple threads in GPU in same block, there are instructions for this and not using these do not make any valid data update. So it's just an optional cache coherence in GPU.