Search code examples
c++openmp

C++ OpenMP Fibonacci: 1 thread performs much faster than 4 threads


I'm trying to understand why the following runs much faster on 1 thread than on 4 threads on OpenMP. The following code is actually based on a similar question: OpenMP recursive tasks but when trying to implement one of the suggested answers, I don't get the intended speedup, which suggests I've done something wrong (and not sure what it is). Do people get better speed when running the below on 4 threads than on 1 thread? I'm getting a 10 times slowdown when running on 4 cores (I should be getting moderate speedup rather than significant slowdown).

int fib(int n)
  {
    if(n == 0 || n == 1)
        return n;
    if (n < 20) //EDITED CODE TO INCLUDE CUTOFF
        return fib(n-1)+fib(n-2); 
    int res, a, b;
    #pragma omp task shared(a)
    a = fib(n-1);
    #pragma omp task shared(b)
    b = fib(n-2);
    #pragma omp taskwait
    res = a+b;
    return res;
  }

int main(){
  omp_set_nested(1);
  omp_set_num_threads(4);
  double start_time = omp_get_wtime();
  #pragma omp parallel
  {
    #pragma omp single
    {
      cout << fib(25) << endl;
    }
  }
  double time = omp_get_wtime() - start_time;
  std::cout << "Time(ms): " << time*1000 << std::endl;
  return 0;
}

Solution

  • Have you tried it with a large number?

    In multi-threading, it takes some time to initialize work on CPU cores. For smaller jobs, which is done very fast on a single core, threading slows the job down because of this.

    Multi-threading shows increase in speed if the job normally takes time longer than second, not milliseconds.

    There is also another bottleneck for threading. If your codes try to create too many threads, mostly by recursive methods, this may cause a delay to all running threads causing a massive set back.

    In this OpenMP/Tasks wiki page, it is mentioned and a manual cut off is suggested. There need to be 2 versions of the function and when the thread goes too deep, it continues the recursion with single threading.

    EDIT: cutoff variable needs to be increased before entering OMP zone.


    the following code is for test purposes for the OP to test

    #define CUTOFF 5
    int fib_s(int n)
    {
        if (n == 0 || n == 1)
            return n;
        int res, a, b;
        a = fib_s(n - 1);
        b = fib_s(n - 2);
        res = a + b;
        return res;
    }
    int fib_m(int n,int co)
    {
        if (co >= CUTOFF) return fib_s(n);
        if (n == 0 || n == 1)
            return n;
        int res, a, b;
        co++;
    #pragma omp task shared(a)
        a = fib_m(n - 1,co);
    #pragma omp task shared(b)
        b = fib_m(n - 2,co);
    #pragma omp taskwait
        res = a + b;
        return res;
    }
    
    int main()
    {
        omp_set_nested(1);
        omp_set_num_threads(4);
        double start_time = omp_get_wtime();
    #pragma omp parallel
        {
    #pragma omp single
            {
                cout << fib_m(25,1) << endl;
            }
        }
        double time = omp_get_wtime() - start_time;
        std::cout << "Time(ms): " << time * 1000 << std::endl;
        return 0;
    }
    

    RESULT: With CUTOFF value set to 10, it was under 8 seconds to calculate 45th term.

    co=1   14.5s
    co=2    9.5s
    co=3    6.4s
    co=10   7.5s
    co=15   7.0s 
    co=20   8.5s
    co=21 >18.0s
    co=22 >40.0s