I have a simple for
-loop that calculates an array[n]
depending on the
corresponding row at an array X[n][d]
.
array *function(X, n, d){
double *array = calloc(n,sizeof(double));
//#pragma omp parallel
{
//#pragma omp parallel for if(n>15000)
for( i=0 ; i<n-1; i++)
{
//#pragma omp parallel for shared(X,j, i) reduction(+: sum)
//#pragma omp parallel for if(d>100) reduction(+:distances[:n]) private(j)
for ( j=0; j< d; j++)
{
array[i] += (pow(X[(j+1)*n-1]-X[j*n+i], 2));
}
array[i] = sqrt(array[i]);
}
}
return array;
}
Consider n
to be as high as n=100000
and d
can have a predefined value from d=2
to d=100
. The function()
is called multiple times (2^k
) at each k
th iteration. So the pattern is: at the first iteration it is called once, at the second iteration it is called twice, at the third iteration it is called four times etc...Also n
diminishes by one in every iteration (n-=1
).
I have tried different combinations of the openmp
directives that I have put as comments in the sample code but no matter what I have tried, the code performs equally or better without the openmp
directives.
What are some good ways/techniques to improve the time performance of the above loop using openmp
?
It is hard to tell without something to test it, but I would try something like this:
double* function( double* X, int n, int d ) {
double *array = malloc( n * sizeof( double ) );
#pragma omp parallel for schedule( static )
for( int i = 0 ; i < n - 1; i++ ) {
double sum = 0;
for( int j = 0; j < d; j++ ) {
double dist = X[( j + 1 ) * n - 1] - X[j * n + i];
sum += dist * dist;
}
array[i] = sqrt( sum );
}
return array;
}
I'm not sure it will be any more effective than your code, but it has a few improvements which should have an impact in performance:
calloc()
by a plain malloc()
and used a local variable sum
for accumulating the partial sums.dist
variable to store the temporary distance between the 2 current values of X
, and just multiplied it by itself, avoiding a costly call to the much more complex pow()
function.Now, depending on the result you get from that, you could consider adding the same sort of if( n > NMIN )
statement as you had initially on the #pragma omp parallel for
directive. The value for this NMIN
would be for you to determine according to your measured performance.
Another possible direction for optimization would be to place the parallel
directive outside of the function. That would spare you numerous thread starts/stops. However, you'd have to add a #pragma omp single
before the call to malloc()
and remove the parallel
from the existing directive.