N-body algorithm: why is this slower in parallel?

I put together a sample program that mimics the type of data structure I am dealing with. Namely that I have n objects, and I need to iterate between each possible pair once and perform a (symmetric) calculation. This operation involves writing data to both pairs. In serial, this would take the form of a loop like this

for(int i = 0; i < N-1; ++i)
   for(int j = i + 1; j < N; ++j)
      ...

However, it did not take much searching on the Internet to find a "cache oblivious parallel implementation", which I wrote up and reproduced below. This post (which uses Intel TBB) describes the algorithm in detail.

I tried using OpenMP tasks to perform the same thing, and it always runs slower than the serial counterpart (simply compiling without -fopenmp). I compile it with g++ -Wall -std=c++11 -O3 test.cpp -o test. The same is observed with or without -O3; the serial is always faster.

To add a bit more information, in my real application, there are typically between a few hundred and a few thousand elements (variable n in the example below) that I need to loop over in this pair-wise fashion many times. Millions of times. My attempt below tries to simulate that (though I only tried looping 10-100k times).

I've timed this very crudely using time ./test simply because there's so much of a difference. Yes, I know my example is poorly written, and that I am including the time required to create the vector in my example. But time for serial gives me ~30 seconds and over a minute in parallel so I don't think I need to do anything more rigorous just yet.

My question is: Why does the serial do better? Have I done something incorrect in OpenMP? How do I properly correct my mistake(s)? Have I misused tasks? I have a feeling that the recursive tasking has something to do with it, and I tried setting 'OMP_THREAD_LIMIT' to 4, but it did not make a substantial difference. Is there a better way of implementing this using OpenMP?

Note: my question is specifically asking how to fix this particular implementation so that it works properly in parallel. Though if anyone knows an alternative solution to this problem and its proper implementation in OpenMP, I am open to that as well.

#include <vector>
#include <iostream>

std::vector<std::vector<double>> testme;

void rect(int i0, int i1, int j0, int j1)
{
    int di = i1 - j0;
    int dj = j1 - j0;
    constexpr int threshold = 16;
    if(di > threshold && dj > threshold)
    {
        int im = i0 + di/2;
        int jm = j0 + dj/2;
        #pragma omp task
        { rect(i0, im, j0, jm); }
        rect(im, i1, jm, j1);
        #pragma omp taskwait
        
        #pragma omp task 
        { rect(i0, im, jm, j1); }
        rect(im, i1, j0, jm);
        #pragma omp taskwait
    }
    else
    {
        for(int i = i0; i < i1; ++i)
            for(int j = j0; j < j1; ++j) {
                testme[i][j] = 1;
                testme[j][i] = 1;
            }
                
    }
}

void triangle(int n0, int n1)
{
        int dn = n1 - n0;
        if(dn > 1)
        {
            int nm = n0 + dn/2;
            #pragma omp task
            { triangle(n0, nm); }
            triangle(nm, n1);
            #pragma omp taskwait
     
            rect(n0, nm, nm, n1);
        }
}


void calc_force(int nbodies)
{
    #pragma omp parallel num_threads(4)
    #pragma omp single
    triangle(0, nbodies);
}

int main()
{
    int n = 400;
    for(int i = 0; i < n; ++i)
    {
        std::vector<double> temp;
        for(int j = 0; j < n; ++j)
            temp.push_back(0);
        
        testme.push_back(temp);
    }

    // Repeat a bunch of times.
    for(int i = 0; i < 100000; ++i)
        calc_force(n);
        
    return 0;
}

Solution

The simple idea of using a recursive algorithm for such a workload seems already very strange to me. And then, to parallelise it using OpenMP tasks seems even stranger... Why not tackling the problem with a more conventional approach?

So I decided to give it a go with a few methods that came to my mind. But to make the exercise sensible, it was important that some actual work was done for computing the "symmetric calculation", otherwise, just iterating on a all elements without accounting for the symmetrical aspect would certainly be the best option.

So I wrote a a force() function computing something loosely related to gravitational interactions between two bodies, based on their coordinates. Then I tested 4 different methods to iterate on the particles:

A naïve triangular approach such as the one you proposed. Due to it's intrinsic load-unbalanced aspect, this one is parallelised with a schedule(auto) clause to permit the run-time library to take the decision it thinks best for performance.
A "clever" traversal of the triangular domain, consisting in cutting it in half in the j direction to allow for using 2 regular loops. Basically, it corresponds to to something like this:
```
   /|
  / |       __  __
 /  |  =>  | //   |
/___|      |//____|
```
A straightforward rectangular approach, just ignoring the symmetry. NB, this one, like your recursive approach, guarantees non-concurrent accesses to the force array.
A linearised method consisting in pre-computing the order of i and j indexes for accessing the triangular domain, and to iterate over the vector containing these indexes.

Since the vector where the forces are accumulated with a force[i] += fij; force[j] -= fij; approach would generate race conditions for the updates in the non-parallelised index (j for example in the method #1), I've created a local pre-thread force array, which is initialised to 0 upon entry to the parallel region. The computations are then done pre-thread on this "private" array, and the individual contributions are accumulated on the global force array with a critical construct upon exit of the parallel region. This is a typical reduction pattern for arrays in OpenMP.

Here is the full code for this:

#include <iostream>
#include <vector>
#include <cstdlib>
#include <cmath>
#include <omp.h>

std::vector< std::vector<double> > l_f;
std::vector<double> x, y, f;
std::vector<int> vi, vj;

int numth, tid;
#pragma omp threadprivate( tid )

double force( int i, int j ) {
    double dx = x[i] - x[j];
    double dy = y[i] - y[j];
    double dist2 = dx*dx + dy*dy;
    return dist2 * std::sqrt( dist2 );
}

void loop1( int n ) {
    #pragma omp parallel
    {
        for ( int i = 0; i < n; i++ ) {
            l_f[tid][i] = 0;
        }
        #pragma omp for schedule(auto) nowait
        for ( int i = 0; i < n-1; i++ ) {
            for ( int j = i+1; j < n; j++ ) {
                double fij = force( i, j );
                l_f[tid][i] += fij;
                l_f[tid][j] -= fij;
            }
        }
        #pragma omp critical
        for ( int i = 0; i < n; i++ ) {
            f[i] += l_f[tid][i];
        }
    }
}

void loop2( int n ) {
    int m = n/2-1+n%2;
    #pragma omp parallel
    {
        for ( int i = 0; i < n; i++ ) {
            l_f[tid][i] = 0;
        }
        #pragma omp for schedule(static) nowait
        for ( int i = 0; i < n; i++ ) { 
            for ( int j = 0; j < m; j++ ) {
                int ii, jj;
                if ( j < i ) {
                    ii = n-1-i;
                    jj = n-1-j;
                }
                else {
                    ii = i;
                    jj = j+1;
                }
                double fij = force( ii, jj );
                l_f[tid][ii] += fij;
                l_f[tid][jj] -= fij;
            }
        }
        if ( n%2 == 0 ) {
            #pragma omp for schedule(static) nowait
            for ( int i = 0; i < n/2; i++ ) {
                double fij = force( i, n/2 );
                l_f[tid][i] += fij;
                l_f[tid][n/2] -= fij;
            }
        }
        #pragma omp critical
        for ( int i = 0; i < n; i++ ) {
            f[i] += l_f[tid][i];
        }
    }
}

void loop3( int n ) {
    #pragma omp parallel for schedule(static)
    for ( int i = 0; i < n; i++ ) {
        for ( int j = 0; j < n; j++ ) {
            if ( i < j ) {
                f[i] += force( i, j );
            }
            else if ( i > j ) {
                f[i] -= force( i, j );
            }
        }
    }
}

void loop4( int n ) {
    #pragma omp parallel
    {
        for ( int i = 0; i < n; i++ ) {
            l_f[tid][i] = 0;
        }
        #pragma omp for schedule(static) nowait
        for ( int k = 0; k < vi.size(); k++ ) {
            int i = vi[k];
            int j = vj[k];
            double fij = force( i, j );
            l_f[tid][i] += fij;
            l_f[tid][j] -= fij;
        }
        #pragma omp critical
        for ( int i = 0; i < n; i++ ) {
            f[i] += l_f[tid][i];
        }
    }
}

int main( int argc, char *argv[] ) {
    if ( argc != 2 ) {
        std::cout << "need the dim as argument\n";
        return 1;
    }
    int n = std::atoi( argv[1] );

    // initialise data
    f.resize( n );
    x.resize( n );
    y.resize( n );
    for ( int i = 0; i < n; ++i ) {
        x[i] = y[i] = i;
        f[i] = 0;
    }

    // initialising linearised index vectors
    for ( int i = 0; i < n-1; i++ ) {
        for ( int j = i+1; j < n; j++ ) {
            vi.push_back( i );
            vj.push_back( j );
        }
    }
    // initialising the local forces vectors
    #pragma omp parallel
    {
        tid = omp_get_thread_num();
        #pragma master
        numth = omp_get_num_threads();
    }
    l_f.resize( numth );
    for ( int i = 0; i < numth; i++ ) {
        l_f[i].resize( n );
    }

    // testing all methods one after the other, with a warm up before each
    int lmax = 10000;
    loop1( n );
    double tbeg = omp_get_wtime();
    for ( int l = 0; l < lmax; l++ ) {
        loop1( n );
    }
    double tend = omp_get_wtime();
    std::cout << "Time for triangular loop is " << tend-tbeg << "s\n";

    loop2( n );
    tbeg = omp_get_wtime();
    for ( int l = 0; l < lmax; l++ ) {
        loop2( n );
    }
    tend = omp_get_wtime();
    std::cout << "Time for mangled rectangular loop is " << tend-tbeg << "s\n";

    loop3( n );
    tbeg = omp_get_wtime();
    for ( int l = 0; l < lmax; l++ ) {
        loop3( n );
    }
    tend = omp_get_wtime();
    std::cout << "Time for naive rectangular loop is " << tend-tbeg << "s\n";

    loop4( n );
    tbeg = omp_get_wtime();
    for ( int l = 0; l < lmax; l++ ) {
        loop4( n );
    }
    tend = omp_get_wtime();
    std::cout << "Time for linearised loop is " << tend-tbeg << "s\n";

    int ret = f[n-1];
    return ret;
}

Now, it becomes simple to evaluate their relative performance and scalability. All methods are timed on a loop, after a first non-timed warm-up iteration.

Compilation:

g++ -O3 -mtune=native -march=native -fopenmp tbf.cc -o tbf

Results on a 8 cores IvyBridge CPU:

> OMP_NUM_THREADS=1 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 9.21198s
Time for mangled rectangular loop is 10.1316s
Time for naive rectangular loop is 15.9408s
Time for linearised loop is 10.6449s
> OMP_NUM_THREADS=2 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 6.84671s
Time for mangled rectangular loop is 5.13731s
Time for naive rectangular loop is 8.09542s
Time for linearised loop is 5.4654s
> OMP_NUM_THREADS=4 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 4.03016s
Time for mangled rectangular loop is 2.90809s
Time for naive rectangular loop is 4.45373s
Time for linearised loop is 2.7733s
> OMP_NUM_THREADS=8 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 2.31051s
Time for mangled rectangular loop is 2.05854s
Time for naive rectangular loop is 3.03463s
Time for linearised loop is 1.7106s

So in this case, the method #4 seems the best option with both good performance and very good scalability. Notice that the straightforward triangular approach isn't too bad, thanks to a good load-balancing job from the schedule(auto) directive. But ultimately, I would encourage you to test with your own workload...

For reference, your initial code, (modified for computing the force() the exact same way as for the other tests, including the number of OpenMP threads used, but without the need of local force arrays and final reduction, tanks to the recursive approach) gives this:

> OMP_NUM_THREADS=1 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 9.32888s
> OMP_NUM_THREADS=2 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 9.48718s
> OMP_NUM_THREADS=4 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 10.962s
> OMP_NUM_THREADS=8 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 13.2786