OpenMP tasks - and the cost of "OpenMP if"

I am trying to understand the way OpenMP's task works.

So I started with the simplest possible test, following OpenMP 4.5's example of Fibonacci computations:

// Listing 1

#include <omp.h>
#include <stdio.h>

long fib(int n)
{
    int i, j;
    if (n<2)
        return n;
    else {
        #pragma omp task shared(i)
        i=fib(n-1);
        #pragma omp task shared(j)
        j=fib(n-2);
        #pragma omp taskwait
        return i+j;
    }
}

int main()
{
    #pragma omp parallel
    #pragma omp single
    {
        long res = fib(27);
        printf("fib(27)=%ld\n", res);
    }
}

It is clear that we will be launching a huge number of tasks here - so indeed, it is no suprise that the OpenMP version is abysmally slower than the normal one:

$ gcc -O2  fib_slow.c
$ time ./a.out
fib(27)=196418

real    0m0.003s
user    0m0.000s
sys     0m0.000s

$ gcc -O2  fib_slow.c -fopenmp
$ time ./a.out
fib(27)=196418

real    0m0.243s
user    0m0.468s
sys     0m0.080s

This test was run in a two-core VM. Notice that time reports double the user time than the real time, which means we did use the second core; but we basically wasted all the time in useless tasking work instead of the actual computation.

Fair enough - the example's text in fact warned us that this is just an example, made for educational purposes.

Since we are testing in a dual-core machine, perhaps it's simpler to use the OpenMP's "if" construct to only launch two threads at the very top level: one computing fib(N-2) and one fib(N-1).

// Listing 2

#include <omp.h>
#include <stdio.h>

long fib(int val)
{
    if (val < 2)
        return val;

    long total = 0;
    {
        #pragma omp task shared(total) if(val==45)
        total += fib(val-1);
        #pragma omp task shared(total) if(val==45)
        total += fib(val-2);
        #pragma omp taskwait
    }
    return total;
}

int main()
{
    #pragma omp parallel
    #pragma omp single
    {
        long res = fib(45);
        printf("fib(45)=%ld\n", res);
    }
}

Assuming that my understanding of "if" is correct, this should only launch two tasks at the top-level (when the input is 45) - and will therefore make better use of our two cores.

I am also bumping up the test input to 45, to make this last longer.

$ gcc -O2  fib_nice.c
$ time ./a.out
fib(45)=1134903170

real    0m8.196s
user    0m8.192s
sys     0m0.000s

$ gcc -O2  fib_nice.c -fopenmp
$ time ./a.out
fib(45)=1134903170

real    1m33.237s
user    2m33.348s
sys     0m0.012s

Oh-oh - this definitely didn't perform as I expected.

Why?

Perhaps I am using the OpenMP "if" construct wrongly (though GCC didn't tell me I did) - but to be sure, I'll take the decision to spawn a task myself:

// Listing 3

#include <omp.h>
#include <stdio.h>

long fib(int val)
{
    if (val < 2)
        return val;

    long total = 0;
    {
        if (val == 45) {
            #pragma omp task shared(total)
            total += fib(val-1);
            #pragma omp task shared(total)
            total += fib(val-2);
            #pragma omp taskwait
        } else
            return fib(val-1) + fib(val-2);
    }
    return total;
}

int main()
{
    #pragma omp parallel
    #pragma omp single
    {
        long res = fib(45);
        printf("fib(45)=%ld\n", res);
    }
}

Never mind the potential for a race on total - that's not the point; I just want to see my 2nd core do something to improve the timing.

Did it?

$ gcc -O2  fib_nicer.c
$ time ./a.out
fib(45)=1134903170

real    0m7.974s
user    0m7.968s
sys     0m0.000s

$ gcc -O2  fib_nicer.c -fopenmp
$ time ./a.out
fib(45)=1134903170

real    0m8.773s
user    0m14.300s
sys     0m0.000s

Apparently, taking the decision to spawn a task myself dramatically improved the OpenMP execution time. No idea why.

But we are still slower than single-core execution... Even though the 1st core doing fib(43) and the 2nd core doing fib(44) should have helped.

Could it be that the OpenMP #pragmas cost us time at run-time - to such an extent that they invalidate the whole endeavour?

Let's do a final experiment - in the most idiotic way possible:

// Listing 4

#include <omp.h>
#include <stdio.h>

long fib_naive(int val)
{
    if (val < 2)
        return val;
    else
        return fib_naive(val-1) + fib_naive(val-2);
}

long fib(int val)
{
    long total = 0;
    {
        #pragma omp task shared(total)
        total += fib_naive(val-1);
        #pragma omp task shared(total)
        total += fib_naive(val-2);
        #pragma omp taskwait
    }
    return total;
}

int main()
{
    #pragma omp parallel
    #pragma omp single
    {
        long res = fib(45);
        printf("fib(45)=%ld\n", res);
    }
}

This is basically manually spawning two threads. Surely, this must work...

$ gcc -O2  fib.c
$ time ./a.out
fib(45)=1134903170

real    0m8.738s
user    0m8.728s
sys     0m0.004s

$ gcc -O2  fib.c -fopenmp
$ time ./a.out
fib(45)=1134903170

real    0m5.446s
user    0m8.928s
sys     0m0.004s

And indeed, it does - we finish in 5.4 seconds, compared to the 8.7 of the single-threaded execution. I theorize that the if in Listing 3 (that spawns the top-level threads) costs a lot in the end, since it is executed for every single addition in our computation; every recursive call has to go through it.

Other than that, if you see something wrong in the steps I followed, please advise - because my takeaway so far is that OpenMP's ifs are... insanely slower than normal code ifs.

Thanks in advance for your insights/suggestions.

Solution

I opened a ticket on GCC's libgomp about this - and as you can read there, Jakub Jelinek explained that an "if(false)" in an OpenMP task is not equivalent to not having a task spawned - in fact, the data structures related to a task are created, the parent task gets suspended, and this new child task starts running immediately - with the parent resuming as soon as that is done. Needless to say, this is a lot more work than just running the code...

In addition, Jakub noted that in the non-OpenMP recursion, ttail recursion optimization takes place - something that can't happen with OpenMP, even if the "mergeable" clause was used.

Suffice to say, I learned a lot - thanks, Jakub.