The sample code below is a simplified version of my working code. In this code, writing to shared variable is done only at the last line where std::vector::push_back
is called.
std::vector<struct FortyByteStruct> results;
#pragma omp parallel for num_threads(8)
for (int i = 0; i < 250; i++)
{
struct FortyByteStruct result = some_heavy_work(i);
#pragma omp critical
{
results.push_back(result);
}
}
I was wondering if this push_back
operation would result in false-sharing, giving me some chance to optimize further by getting rid of it. I've decided to do some bench tests first, before digging into this issue.
With chrono
, I've measured the wall clock execution time of some_heavy_work()
and the critical section separately. The latter took about 10^(-4) times of execution time of the former, so I've concluded that there would be almost no benefit from optimizing this part whether false-sharing is involved or not.
Anyway, I'm still curious whether false-sharing is an issue here. Do I have to look at the internal implementation of std::vector
? Any enlightment would be greatly appreciated. (I'm on VS2015)
Given that your FortyByteStruct
is probably smaller than a cache line (usually 64 byte), there may some false sharing when writing the results data. However, it is hardly consequential, because it will be overshadowed by the cost of the critical section - and also the "true" sharing of modifying the vector
itself (not it's data). You don't need to know the details of the std::vector
's implementation - only that it's data is contiguous in memory and that it's state (pointer(s) to data/size/capacity) is in the memory of the vector variable itself. False sharing is usually an issue when separate data on the same cache line is accessed by multiple threads in an unprotected fashion. Keep in mind that false sharing does not affect correctness, only performance.
A slightly different example of false sharing would be when you have a std::vector<std::vector<struct FortyByteStruct>>
and each thread performs an unprotected push_back
. I explained that in detail here.
In your example, with a known total size of the vector, the best approach would be to resize the vector before the loop and then just assign results[i] = result
. This avoids the critical section and OpenMP typically distributes the loop iterations in such a way that there is little false sharing. Also you get a deterministic order of results
.
That said, when you have have confirmed by measurement that the time is dominated by some_heavy_work
, then you are fine.