I am trying to understand the way OpenMP's task
works.
So I started with the simplest possible test, following OpenMP 4.5's example of Fibonacci computations:
// Listing 1
#include <omp.h>
#include <stdio.h>
long fib(int n)
{
int i, j;
if (n<2)
return n;
else {
#pragma omp task shared(i)
i=fib(n-1);
#pragma omp task shared(j)
j=fib(n-2);
#pragma omp taskwait
return i+j;
}
}
int main()
{
#pragma omp parallel
#pragma omp single
{
long res = fib(27);
printf("fib(27)=%ld\n", res);
}
}
It is clear that we will be launching a huge number of tasks here - so indeed, it is no suprise that the OpenMP version is abysmally slower than the normal one:
$ gcc -O2 fib_slow.c
$ time ./a.out
fib(27)=196418
real 0m0.003s
user 0m0.000s
sys 0m0.000s
$ gcc -O2 fib_slow.c -fopenmp
$ time ./a.out
fib(27)=196418
real 0m0.243s
user 0m0.468s
sys 0m0.080s
This test was run in a two-core VM. Notice that time
reports double the user time than the real time, which means we did use the second core; but we basically wasted all the time in useless tasking work instead of the actual computation.
Fair enough - the example's text in fact warned us that this is just an example, made for educational purposes.
Since we are testing in a dual-core machine, perhaps it's simpler to use the OpenMP's "if" construct to only launch two threads at the very top level: one computing fib(N-2) and one fib(N-1).
// Listing 2
#include <omp.h>
#include <stdio.h>
long fib(int val)
{
if (val < 2)
return val;
long total = 0;
{
#pragma omp task shared(total) if(val==45)
total += fib(val-1);
#pragma omp task shared(total) if(val==45)
total += fib(val-2);
#pragma omp taskwait
}
return total;
}
int main()
{
#pragma omp parallel
#pragma omp single
{
long res = fib(45);
printf("fib(45)=%ld\n", res);
}
}
Assuming that my understanding of "if" is correct, this should only launch two tasks at the top-level (when the input is 45) - and will therefore make better use of our two cores.
I am also bumping up the test input to 45, to make this last longer.
$ gcc -O2 fib_nice.c
$ time ./a.out
fib(45)=1134903170
real 0m8.196s
user 0m8.192s
sys 0m0.000s
$ gcc -O2 fib_nice.c -fopenmp
$ time ./a.out
fib(45)=1134903170
real 1m33.237s
user 2m33.348s
sys 0m0.012s
Oh-oh - this definitely didn't perform as I expected.
Why?
Perhaps I am using the OpenMP "if" construct wrongly (though GCC didn't tell me I did) - but to be sure, I'll take the decision to spawn a task myself:
// Listing 3
#include <omp.h>
#include <stdio.h>
long fib(int val)
{
if (val < 2)
return val;
long total = 0;
{
if (val == 45) {
#pragma omp task shared(total)
total += fib(val-1);
#pragma omp task shared(total)
total += fib(val-2);
#pragma omp taskwait
} else
return fib(val-1) + fib(val-2);
}
return total;
}
int main()
{
#pragma omp parallel
#pragma omp single
{
long res = fib(45);
printf("fib(45)=%ld\n", res);
}
}
Never mind the potential for a race on total
- that's not the point; I just want to see my 2nd core do something to improve the timing.
Did it?
$ gcc -O2 fib_nicer.c
$ time ./a.out
fib(45)=1134903170
real 0m7.974s
user 0m7.968s
sys 0m0.000s
$ gcc -O2 fib_nicer.c -fopenmp
$ time ./a.out
fib(45)=1134903170
real 0m8.773s
user 0m14.300s
sys 0m0.000s
Apparently, taking the decision to spawn a task myself dramatically improved the OpenMP execution time. No idea why.
But we are still slower than single-core execution... Even though the 1st core doing fib(43) and the 2nd core doing fib(44) should have helped.
Could it be that the OpenMP #pragma
s cost us time at run-time - to such an extent that they invalidate the whole endeavour?
Let's do a final experiment - in the most idiotic way possible:
// Listing 4
#include <omp.h>
#include <stdio.h>
long fib_naive(int val)
{
if (val < 2)
return val;
else
return fib_naive(val-1) + fib_naive(val-2);
}
long fib(int val)
{
long total = 0;
{
#pragma omp task shared(total)
total += fib_naive(val-1);
#pragma omp task shared(total)
total += fib_naive(val-2);
#pragma omp taskwait
}
return total;
}
int main()
{
#pragma omp parallel
#pragma omp single
{
long res = fib(45);
printf("fib(45)=%ld\n", res);
}
}
This is basically manually spawning two threads. Surely, this must work...
$ gcc -O2 fib.c
$ time ./a.out
fib(45)=1134903170
real 0m8.738s
user 0m8.728s
sys 0m0.004s
$ gcc -O2 fib.c -fopenmp
$ time ./a.out
fib(45)=1134903170
real 0m5.446s
user 0m8.928s
sys 0m0.004s
And indeed, it does - we finish in 5.4 seconds, compared to the 8.7 of the single-threaded execution. I theorize that the if
in Listing 3 (that spawns the top-level threads) costs a lot in the end, since it is executed for every single addition in our computation; every recursive call has to go through it.
Other than that, if you see something wrong in the steps I followed, please advise - because my takeaway so far is that OpenMP's if
s are... insanely slower than normal code if
s.
Thanks in advance for your insights/suggestions.
I opened a ticket on GCC's libgomp about this - and as you can read there, Jakub Jelinek explained that an "if(false)" in an OpenMP task
is not equivalent to not having a task spawned - in fact, the data structures related to a task are created, the parent task gets suspended, and this new child task starts running immediately - with the parent resuming as soon as that is done. Needless to say, this is a lot more work than just running the code...
In addition, Jakub noted that in the non-OpenMP recursion, ttail recursion optimization takes place - something that can't happen with OpenMP, even if the "mergeable" clause was used.
Suffice to say, I learned a lot - thanks, Jakub.