Optimize C for-loop performance with openmp

I have a simple for-loop that calculates an array[n] depending on the corresponding row at an array X[n][d].

array *function(X, n, d){
    double *array = calloc(n,sizeof(double));
    //#pragma omp parallel
    {
        //#pragma omp parallel for if(n>15000)
        for( i=0 ; i<n-1; i++)
        {
            //#pragma omp parallel for shared(X,j, i) reduction(+: sum)
            //#pragma omp parallel for if(d>100) reduction(+:distances[:n]) private(j)
            for ( j=0; j< d; j++)
            {
                array[i] += (pow(X[(j+1)*n-1]-X[j*n+i], 2));
            }
            array[i] = sqrt(array[i]);
        }
    }
    return array;
}

Consider n to be as high as n=100000 and d can have a predefined value from d=2 to d=100. The function() is called multiple times (2^k) at each kth iteration. So the pattern is: at the first iteration it is called once, at the second iteration it is called twice, at the third iteration it is called four times etc...Also n diminishes by one in every iteration (n-=1).

I have tried different combinations of the openmp directives that I have put as comments in the sample code but no matter what I have tried, the code performs equally or better without the openmp directives.

What are some good ways/techniques to improve the time performance of the above loop using openmp?

Solution

It is hard to tell without something to test it, but I would try something like this:

double* function( double* X, int n, int d ) {
    double *array = malloc( n * sizeof( double ) );
    #pragma omp parallel for schedule( static )
    for( int i = 0 ; i < n - 1; i++ ) {
        double sum = 0;
        for( int j = 0; j < d; j++ ) {
           double dist = X[( j + 1 ) * n - 1] - X[j * n + i];
           sum += dist * dist;
        }
        array[i] = sqrt( sum );
    }
    return array;
}

I'm not sure it will be any more effective than your code, but it has a few improvements which should have an impact in performance:

To avoid initializing the array to 0 in the sequential part and also allow for hopefully better optimization from the compiler, I replaced the calloc() by a plain malloc() and used a local variable sum for accumulating the partial sums.
I've put the parallelization pragma as outermost as I could to maximize parallelism.
I've used a local dist variable to store the temporary distance between the 2 current values of X, and just multiplied it by itself, avoiding a costly call to the much more complex pow() function.

Now, depending on the result you get from that, you could consider adding the same sort of if( n > NMIN ) statement as you had initially on the #pragma omp parallel for directive. The value for this NMIN would be for you to determine according to your measured performance. Another possible direction for optimization would be to place the parallel directive outside of the function. That would spare you numerous thread starts/stops. However, you'd have to add a #pragma omp single before the call to malloc() and remove the parallel from the existing directive.