Search code examples
c++loopsopenmpcompiler-optimizationfalse-sharing

Compiler optimization eliminates effects of false sharing. How?


I'm trying to replicate the effects of false sharing using OpenMP as explained in the OpenMP introduction by Tim Mattson.

My program performs a straightforward numerical integration (see the link for the mathematical details) and I've implemented two versions, the first of which is supposed to be cache-friendly having each thread keep a local variable to accumulate its portion of the index space,

const auto num_slices = 100000000; 
const auto num_threads = 4;  // Swept from 1 to 9 threads
const auto slice_thickness = 1.0 / num_slices;
const auto slices_per_thread = num_slices / num_threads;

std::vector<double> partial_sums(num_threads);

#pragma omp parallel num_threads(num_threads)
{
  double local_buffer = 0;
  const auto thread_num = omp_get_thread_num();
  for(auto slice = slices_per_thread * thread_num; slice < slices_per_thread * (thread_num + 1); ++slice)
    local_buffer += func(slice * slice_thickness); // <-- Updates thread-exclusive buffer
  partial_sums[thread_num] = local_buffer; 
}
// Sum up partial_sums to receive final result
// ...

while the second version has each thread update an element in a shared std::vector<double>, causing each write to invalidate the cache lines on all other threads

// ... as above
#pragma omp parallel num_threads(num_threads)
{
  const auto thread_num = omp_get_thread_num();
  for(auto slice = slices_per_thread * thread_num; slice < slices_per_thread * (thread_num + 1); ++slice)
    partial_sums[thread_num] += func(slice * slice_thickness); // <-- Invalidates caches
}
// Sum up partial_sums to receive final result
// ...

The problem is that I am unable to see any effects of false sharing whatsoever unless I turn off optimization.

enter image description here

Compiling my code (which has to account for a few more details than the snippets above) using GCC 8.1 without optimization (-O0) yields the results I naively expected while using full optimization (-O3) eliminates any difference in terms of performance between the two versions, as shown in the plot.

What's the explanation for this? Does the compiler actually eliminate false sharing? If not, how come the effect is so small when running the optimized code?

I'm on a Core-i7 machine using Fedora. The plot displays mean values whose sample standard deviations don't add any information to this question.


Solution

  • tl;dr: The compiler optimizes your second version into the first.

    Consider the code within the loop of your second implementation - ignoring the OMP/multithreaded aspect of it for a moment.

    You have increments of a value within an std::vector - which is necessarily located on the heap (well, up until and including in C++17 anyway). The compiler sees you're adding to a value on the heap within a loop; this is a typical candidate for optimization: It takes the heap access out of the loop, and uses a register as a buffer. It doesn't even need to read from the heap, since they're just additions - so it essentially arrives at your first solution.

    See this happening on GodBolt (with a simplified example) - notice how the code for bar1() and bar2() is almost the same, with accumulation happening in registers.

    Now, the fact that there's multi-threading and OMP involved doesn't change the above. If you were to use, say, std::atomic<double> instead of double, then it might have changed (and maybe not even then, if the compiler is smart enough).


    Notes:

    • Thanks goes to @Evg for noticing a glaring mistake in the code of a previous version of this answer.
    • The compiler must be able to know that func() won't also change the value of your vector - or to decide that, for the purposes of addition, it shouldn't really matter.
    • This optimization could be seen as a Strength Reduction - from an operation on the heap to that on a register - but I'm not sure that term is in use for this case.