I understand that in symmetric multiprocessor (SMP) systems, false sharing may occur due to the individual caches in each cores, for the following code: http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
01 double sum=0.0, sum_local[NUM_THREADS];
02 #pragma omp parallel num_threads(NUM_THREADS)
03 {
04 int me = omp_get_thread_num();
05 sum_local[me] = 0.0;
06
07 #pragma omp for
08 for (i = 0; i < N; i++)
09 sum_local[me] += x[i] * y[i];
10
11 #pragma omp atomic
12 sum += sum_local[me];
13 }
So my questions are:
It's not that the cache line is read or written together, it is that if any CPU writes any byte it invalidates the whole cache line.
Depends on the GPU. At least for some NVidia GPUs, L1 cache (of the global memory) is NOT coherent, so you have other problems. With L1 disabled you can have the problem in L2 cache which is coherent.