avoiding false sharing to improve performance

#include <iostream>
#include <future>
#include <chrono>

using namespace std;
using namespace std::chrono;

int a = 0;
int padding[16]; // avoid false sharing
int b = 0;

promise<void> p;
shared_future<void> sf = p.get_future().share();

void func(shared_future<void> sf, int &data)
{
  sf.get();

  auto t1 = steady_clock::now();
  while (data < 1'000'000'000)
    ++data;
  auto t2 = steady_clock::now();

  cout << duration<double, ratio<1, 1>>(t2 - t1).count() << endl;
}

int main()
{
  thread th1(func, sf, ref(a)), th2(func, sf, ref(b));
  p.set_value();
  th1.join();
  th2.join();

  return 0;
}

I use the above code to demonstrate the impact of false sharing on performance. But to my surprise, the padding seemingly doesn't speed up the program at all. Interestingly, if both a and b are atomic variables, there is a obvious improvement. What's the difference?

Solution

False sharing is best detectable when 2 atomic variables in the same cache line are incremented by different threads with a read-modify-write (RMW) operation. To do that, each CPU has to flush the store buffer and lock the cache line for the duration of an increment operation, that is:

lock cache line
read value from L1 cache into register
increment value inside register
write back to L1 cache
unlock cache line

The effect of a single cache line constantly bouncing between CPU's is noticeable, even with full compiler optimization. Forcing both variables to be in different cache lines (by adding padding data) may cause a significant increase in performance since each CPU will have full access to its own cache line. Locking the cache line is still necessary, but no time is wasted on obtaining read-write access to the cache line.

If both variables are plain integers, the situation is different because incrementing an integer involves a plain load & store (ie. not an atomic RMW operation).
Without padding, the effects of the cache line bouncing between cores may still be noticeable, but on a much smaller scale since no cache-line locking is involved anymore. If you compile with full optimization, the entire while-loop is probably going to be replaced with a single increment and there will be no difference anymore.

On my 4-core X86, I get the following numbers:

atomic int, no padding, no optimization: real 57.960s, user 114.495s

atomic int, padding, no optimization: real 10.514s, user 20.793s

atomic int, no padding, full optimization: real 55.732s, user 110.178s

atomic int, padding, full optimization: real 8.712s, user 17.214s

int, no padding, no optimization: real 2.206s, user 4.348s

int, padding, no optimization: real 1.951s, user 3.853s

int, no padding, full optimization: real 0.002s, user 0.000s

int, padding, full optimization: real 0.002s, user 0.000s