Looking for a good way to use openMP with recursive code?

I wanted to optimize recursion with openMP. So, I started from this question: best-way-to-parallelize-this-recursion-using-openmp

While looking to optimize recursive functions I was first interested in openMP. Starting from the code below that I found there (https://en.cppreference.com/w/cpp/chrono) :

#include <iostream>
#include <chrono>
 
long fibonacci(unsigned n)
{
    if (n < 2) return n;
    return fibonacci(n-1) + fibonacci(n-2);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    std::cout << "f(42) = " << fibonacci(42) << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

Compiled with g++ without any optimization it gives the result:

f(42) = 267914296
elapsed time: 1.88232s

Then I used openMP similarly to the answer upside... but I had to stop the process because it took a very very long time. I do not know a lot about openMP recursion parallelism since I just use it to optimize for loops in my code... Also, I found something in gcc documentation :

__attributes__((const))

Then I added it to my openMP "optimized" version but all I obtained is a Core Dump !!!

So I removed my omp pragmas to check that the core dump is due to my code or something else...

Then the code became:

#include <iostream>
#include <chrono>

__attribute__ ((const))
long fibonacci(unsigned n)
{
   
   int long r[2];

   if (n < 2) return n;
   r[0]=fibonacci(n-1);
   r[1]=fibonacci(n-2);
   return (r[0]+r[1]);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    std::cout << "f(42) = " << fibonacci(42) << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

I compiled it with the following command:

g++ fibonacci.cpp -o fibonacci -Wall -O2 -march=native

Now it take a lot less time to perform the fibonacci computation of 42:

f(42) = 267914296
elapsed time: 0.00106504s

in power save mode and in performance mode it gives

f(42) = 267914296
elapsed time: 0.000187806s

Here a lscpu of my computer (if somebody want to compare results):

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            9
    CPU(s) scaling MHz:  95%
    CPU max MHz:         4500,0000
    CPU min MHz:         800,0000
    BogoMIPS:            8403,00

Now, I do not know how much time it could take with openMP because I was not able to obtain a satisfactory result... I hope somebody will find a way to add omp pragmas to this kind of recursive cases.

As required, here is the omp version 1 of this case:

#include <iostream>
#include <chrono>
#include <omp.h>

__attribute__ ((const))
long fibonacci(unsigned n)
{
   
    int long a, b;

    if (n < 2) return n;
    #pragma omp parallel
    #pragma omp single nowait
    {         
       #pragma omp task shared(a)
       a=fibonacci(n-1);
       #pragma omp task shared(b)
       b=fibonacci(n-2);
       #pragma omp taskwait
    }
    return (a+b);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    std::cout << "f(42) = " << fibonacci(42) << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

And the version that drive me to a "silly code":

#include <iostream>
#include <chrono>
#include <omp.h>

__attribute__ ((const))
long fibonacci(unsigned n)
{
   
   int long r[2];
   int i;
   if (n < 2) return n;
   #pragma omp parallel
   #pragma omp single
   for(i=0;i<2;++i) 
   {
       #pragma omp task shared(r)   
       r[i]=fibonacci(n-i-1);
   }      
    
    return (r[0]+r[1]);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    std::cout << "f(42) = " << fibonacci(42) << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

The build command for these omp versions is :

g++ omp_fibonacci_for.cpp -o omp_fibonacci_for -Wall -fopenmp

Both of them seems to loop "for ever" (fortunately Ctrl-C worked fine). But, while writing the end of this question, the second version with "for", process successfully:

f(42) = 267914296
elapsed time: 480.953s

To be clear I'm not looking for "best perf", my objective is to try to obtain a omp version that process in a reasonable time with the goal of finding the way to use openMP on recursive code. Honestly I was expecting a better execution time than 480s ! I was also surprised by the time it took in the "poor debug" version that took less 0.001s to process...

The thing is now I'd like to know if the way I use openMP is acceptable if I apply this model with "havier" tasks or if I do something wrong.

Solution

Perfectible solution

So, after taken into consideration all comments and answers, this new code seems to be a better way to apply openMP pragmas on a recursive code:

#include <iostream>
#include <chrono>
#include <omp.h>

__attribute__ ((const))
long fibonacci(unsigned n)
{
   
    int long a, b;

    if (n < 2) return n;
    #pragma omp task shared(a)
    a=fibonacci(n-1);
    #pragma omp task shared(b)
    b=fibonacci(n-2);
    #pragma omp taskwait
    
    return (a+b);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    long res;
    #pragma omp parallel
    {
        #pragma omp single
        res = fibonacci(42);
    }
    std::cout << "f(42) = " << res << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

Build command:

g++ omp_fibonacci.cpp -o omp_fibonacci -Wall -fopenmp

Output:

f(42) = 267914296
elapsed time: 255.225s

@DanielLangr comments where particularly useful.

candidate solution

This is the best I obtained with the use of openMP.

And, with @JerryCoffin answer and @ThomasMatthiew comments and the advice of @VictorEijkhout... I though about parallelization differently and it drive me to this code:

#include <iostream>
#include <chrono>
#include <omp.h>

__attribute__ ((const))
long fibonacci(unsigned n)
{ 
    int long a, b;
    if (n < 2) return n;
        a=fibonacci(n-2);
        b=fibonacci(n-1);
    return (a+b);
}

long omp_fibonacci(unsigned n)
{
    int long a, b;
    if (n < 2) return n;
    #pragma omp task shared(a)
    a=fibonacci(n-1);
    #pragma omp task shared(b)
    b=fibonacci(n-2);
    #pragma omp taskwait
    
    return (a+b);
}
 
int main()
{
    auto start = std::chrono::steady_clock::now();
    long res;
    #pragma omp parallel 
    {
        #pragma omp single
        res = omp_fibonacci(42);
    }
    std::cout << "f(42) = " << res << '\n';
    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

Outputs:

f(42) = 267914296
elapsed time: 1.00274s