Search code examples
c++concurrencyatomicfalse-sharing

avoiding false sharing to improve performance


#include <iostream>
#include <future>
#include <chrono>

using namespace std;
using namespace std::chrono;

int a = 0;
int padding[16]; // avoid false sharing
int b = 0;

promise<void> p;
shared_future<void> sf = p.get_future().share();

void func(shared_future<void> sf, int &data)
{
  sf.get();

  auto t1 = steady_clock::now();
  while (data < 1'000'000'000)
    ++data;
  auto t2 = steady_clock::now();

  cout << duration<double, ratio<1, 1>>(t2 - t1).count() << endl;
}

int main()
{
  thread th1(func, sf, ref(a)), th2(func, sf, ref(b));
  p.set_value();
  th1.join();
  th2.join();

  return 0;
}

I use the above code to demonstrate the impact of false sharing on performance. But to my surprise, the padding seemingly doesn't speed up the program at all. Interestingly, if both a and b are atomic variables, there is a obvious improvement. What's the difference?


Solution

  • False sharing is best detectable when 2 atomic variables in the same cache line are incremented by different threads with a read-modify-write (RMW) operation. To do that, each CPU has to flush the store buffer and lock the cache line for the duration of an increment operation, that is:

    • lock cache line
    • read value from L1 cache into register
    • increment value inside register
    • write back to L1 cache
    • unlock cache line

    The effect of a single cache line constantly bouncing between CPU's is noticeable, even with full compiler optimization. Forcing both variables to be in different cache lines (by adding padding data) may cause a significant increase in performance since each CPU will have full access to its own cache line. Locking the cache line is still necessary, but no time is wasted on obtaining read-write access to the cache line.

    If both variables are plain integers, the situation is different because incrementing an integer involves a plain load & store (ie. not an atomic RMW operation).
    Without padding, the effects of the cache line bouncing between cores may still be noticeable, but on a much smaller scale since no cache-line locking is involved anymore. If you compile with full optimization, the entire while-loop is probably going to be replaced with a single increment and there will be no difference anymore.

    On my 4-core X86, I get the following numbers:

    atomic int, no padding, no optimization: real 57.960s, user 114.495s
    
    atomic int, padding, no optimization: real 10.514s, user 20.793s
    
    atomic int, no padding, full optimization: real 55.732s, user 110.178s
    
    atomic int, padding, full optimization: real 8.712s, user 17.214s
    
    int, no padding, no optimization: real 2.206s, user 4.348s
    
    int, padding, no optimization: real 1.951s, user 3.853s
    
    int, no padding, full optimization: real 0.002s, user 0.000s
    
    int, padding, full optimization: real 0.002s, user 0.000s