I am trying to optimize the performance in a parallel-for loop where I have a reduction variable (called delta) and I am wondering how this is handled under the hood by the OpenMP library.
Lets take as an example the following piece of code, where I simply declare the variable as a reduction one, at the beginning of the loop as follows:
#pragma omp parallel shared(delta, A, B, rows, colms) private(i, j)
.
.
.
#pragma omp for reduction(+:delta)
for (i=1; i<=rows; i++){
for (j=1; j<=colms; j++){
delta += fabs(A[i][j]- B[i][j]);
}
}
.
.
.
//end of parallel region
I am wondering whether during the calculation each thread sets a lock when accessing the delta variable and furthermore whether I could increase the performance by replacing delta variable with an array delta[number_of_threads], where each thread will write in a different position of the array during the calculation and then sum-up all the elements after the parallel region.
Each thread will have its own copy of 'delta' on its stack frame:
#pragma omp parallel shared(delta, A, B, rows, colms) private(i, j)
{
double local_delta; // one copy per thread
__omp_init_schedule(1, rows, &lb, &ub);
for (i=lb; i<=ub; i++) {
for (j=1; j<=colms; j++) {
local_delta += fabs(A[i][j]- B[i][j]);
}
}
__omp_reduce(&delta, local_delta); // accumulate thread's delta with shared var
__omp_barrier(); // do the barrier of the for construct
}
Please take the above as pseudo-code. The actual code pattern will depend on the implementation, inlining, and all sorts of other optimization an OpenMP implementation might do. If you want to read a bit about how things work please have a look at [1] and [2].
The implementation of __omp_reduce()
can either be something tree-based or sequential using either locks or atomic instructions. OpenMP implementations are usually rather smart and choose the right algorithm for the machine and/or the number of threads being used.
Doing the delta[numthreads]
modification will likely reduce performance by more than 100x, as this is the typical example for false-sharing as delta[0]
for thread 0 and delta[1]
for thread one will be in the same cache line and this cause a lot of traffic on the cache and the memory. A better appraoch would be to introduce patting delta[numthreads * 8]
(assuming that delta
is 8 bytes), so that each thread gets its own cache line. However, then you still need to perform the final aggregation and likely the OpenMP implementation still does a better job.
[2] https://www.dontknow.de/openmp-stuff/thunk-you-very-much-or-how-do-openmp-compilers-work-part-2/