my code on OpenMP gets very slow when I add the (*pRandomTrial)++; after generating random number. To g_iRandomTrials[32] I store number of rand() calls from each thread. Each thread writes different index of this array, there are no race conditions, results are OK, but this very easy counter makes the program almost 10 times slower then without counter. Is there some keyword I can use in this case? I tried some setups with firstprivate(g_iRandomTrials), but I was never successfull. When I create int counter in Simulate() function and use the pointer only twice on start and on the end of function, code will run probably much faster, but this seems as somewhat ugly solution, as it doesn't do anything about the problem...
int g_iRandomTrials[32];
...
#pragma omp parallel
{
do
{
...
Simulate();
...
}
}
void Simulate(void)
{
...
int id=omp_get_thread_num();
int*pRandomTrial=g_iRandomTrials+id;
...
while (used[index])
{
index=rand()%50;
(*pRandomTrial)++;
}
}
The reason for the slow down is called false sharing. The answer is padding.
In computer science, false sharing is a performance-degrading usage pattern that can arise in systems with distributed, coherent caches at the size of the smallest resource block managed by the caching mechanism. When a system participant attempts to periodically access data that will never be altered by another party, but that data shares a cache block with data that is altered, the caching protocol may force the first participant to reload the whole unit despite a lack of logical necessity. The caching system is unaware of activity within this block and forces the first participant to bear the caching system overhead required by true shared access of a resource.
https://en.wikipedia.org/wiki/False_sharing
CPUs lock their memory in something called cachelines. These tend to be 64-bytes in length. When one core accesses a variable, it locks the entire cacheline and fetches it from memory. Other cores can no longer access it until the lock is released.
The answer is to pad and align your randomTrials
in such a way that no value is within 64-bytes of another. Keep in mind that 64-byte value is most common, but there are architectures were this differs.