c multithreading performance parallel-processing openmp

Parallel code with OpenMP takes more time to execute than serial code

I'm trying to make this code to run in parallel. It's a chunk of code from a big project. I thought I started parallelizing slowly to see if there is a problem step by step (I don't know if that's a good tactic so please let me know).

double best_nearby(double delta[MAXVARS], double point[MAXVARS], double prevbest, int nvars)
{
    double z[MAXVARS];
    double minf, ftmp;
    int i;
    minf = prevbest;
    omp_set_num_threads(NUM_THREADS);
    
    #pragma omp parallel for shared(nvars,point,z) private(i)
    for (i = 0; i < nvars; i++)
        z[i] = point[i];
    for (i = 0; i < nvars; i++) {
        z[i] = point[i] + delta[i];
        ftmp = f(z, nvars);
        if (ftmp < minf)
            minf = ftmp;
        else {
            delta[i] = 0.0 - delta[i];
            z[i] = point[i] + delta[i];
            ftmp = f(z, nvars);
            if (ftmp < minf)
                minf = ftmp;
            else
                z[i] = point[i];
        }
    }
    for (i = 0; i < nvars; i++)
        point[i] = z[i];

    return (minf);
}

NUM_THREADS is #defined

The function has some more lines but they are the same among the parallel and the serial.

It looks like the serial code takes on average 130s thus the parallel takes something like 400s. It baffles me that such a small change can lead up to so much increase in exe time. Any ideas on why this happens? Thank you in advance!

double f(double *x, int n){
double fv;
int i;

funevals++;
fv = 0.0;
for (i=0; i<n-1; i++)   /* rosenbrock */
    fv = fv + 100.0*pow((x[i+1]-x[i]*x[i]),2) + pow((x[i]-1.0),2);

return fv;
}

Solution

Currently, you are not parallelizing much. You can start by parallelizing the f function since it looks computational demanding:

double f(double *x, int n){
..
  double fv = 0.0;

  #pragma omp parallel for reduction(+:fv)
  for (int i=0; i<n-1; i++)
       fv = fv + 100.0*pow((x[i+1]-x[i]*x[i]),2) + pow((x[i]-1.0),2);

   return fv;
}

Test and check the results. After that you can try to expand the scope of the parallelization to include also the outermost loop.