Is there a way to speed-up openMP

I am solving a problem that compares execution times between a serial, an mpi and an openMP code. The problem is that the openMP version is slower than mpi. Is there a way to evolve the openMP code below to be faster than mpi?

for(i=0;i<loop;i++)
  {
    #pragma omp parallel for private(k,dx,dy,dz,d,a) schedule(dynamic)
      for(j=0;j<N;j++)
      {
        for(k=0;k<N;k++)
        {
          if(j!=k)
          {
            dx=C[k*3+0]-C[j*3+0];
            dy=C[k*3+1]-C[j*3+1];
            dz=C[k*3+2]-C[j*3+2];

            d=sqrt(pow(dx,2)+pow(dy,2)+pow(dz,2));

            F[j*3+0]-=G*M[j]*M[k]/pow(d,3)*dx;
            F[j*3+1]-=G*M[j]*M[k]/pow(d,3)*dy;
            F[j*3+2]-=G*M[j]*M[k]/pow(d,3)*dz;
          }
        }
      }
      #pragma omp for schedule(dynamic)
        for(j=0;j<N;j++)
        {
          for(k=0;k<3;k++)
          {
            a=F[j*3+k]/M[j];
            V[j*3+k]=V[j*3+k]+a*Dt[i];
            C[j*3+k]=C[j*3+k]+V[j*3+k]*Dt[i];
          }
        }
  }

What this code do is that the outer loop is the times the process is going to take place and is also used in the Dt table in the end. The second loop describes a mass that moves and the third calculates the forces that been pushing it from the other masses existing in the system. The two loop after that calculates the new position. With this in mind I can't move the parallelism in the outer loop because in every i circle a new updated C table needed. So is there anything to be changed so this code can run faster.

For more info about the problem

loop: takes value between 10.000 - 1,000,000,000 (provided from user)
N: takes values between 2 - 10 (provided from user)
C: takes random values between min and max (provided from user)
F and V: initial values 0.00
G: 6.673e-11

Allocation of the tables

M=malloc(N*sizeof(int));

C=malloc(N*3*sizeof(float));

F=malloc(N*3*sizeof(float));
V=malloc(N*3*sizeof(float));

Dt=malloc(loop*sizeof(float));

Table values

for(i=0;i<N;i++)
{
  M[i]=rand()%(high-low+1)+low;
}

for(i=0;i<N*3;i++)
{
  C[i]=rand()%(max-min+1)+min;

  F[i]=0.0;
  V[i]=0.0;
}

for(i=0;i<loop;i++)
{
  Dt[i]=(float)rand()/RAND_MAX;
}

Solution

You may start by replacing schedule(dynamic) with schedule(static). There is absolutely no need for dynamic scheduling here since the amount of work done by each iteration is constant. schedule(dynamic) defaults to chunk size of 1 and dynamically assigns each iteration to some thread with the associated huge overhead.

Dynamic scheduling is useful when each iteration involves a varying amount of work, in which case static scheduling may lead to load imbalance and idling threads. A canonical case is colouring a fractal set. Even then, it is often more reasonable to dispatch work items in chunks of more than one iteration in order to minimise the dispatch overhead.

The second loop is not running in parallel since what you have there is an orphaned OpenMP for construct instead of a combined parallel for one. You also need to make k and a private.

Now that we know that N is really small and loop takes such large values, there are some things that can be done to improve the performance.

First, there is no such thing as "calling omp parallel once to start the parallelism". There are parallel regions that execute in parallel whenever the flow control passes through them. A parallel region is a block of code following the OpenMP parallel construct. OpenMP worksharing constructs such as for only execute in parallel when inside the dynamic scope of a parallel region. The second loop is therefore not parallel since it is only a for construct and is not nested lexically or dynamically in a parallel region.

To make the terminology clear, lexical nesting of an OpenMP construct inside a parallel region means:

#pragma omp parallel
{
   #pragma omp for
   for (...) {}
}

and dynamic nesting means:

foo() {
  #pragma omp for
  for (...) {}
}

#pragma omp parallel
{
  foo();
}

Just calling foo() from outside a parallel region will not make the loop run in parallel.

There are shorthand combined constructs such as parallel for for when the only code in the body of a parallel region is a worksharing construct such as for.

Second, parallel regions are not for free. OpenMP follows the fork/join model of parallel computation where the program executes sequentially until the flow of execution encounters a parallel region. When that happens, a fork of worker threads occurs and the program starts to execute in parallel. At the end of the parallel region, the worker threads are joined back into the main thread and the program continues to execute sequentially. Forking and joining have their price in terms of execution time. Although practically all modern OpenMP runtimes use thread pools and only the very first parallel region activation is really slow due to the time it takes the OS to spawn all the worker threads, the fork/join overhead is still not negligible. Therefore, it is meaningless to use OpenMP unless there is enough work to be distributed between the threads so that the overhead can be amortised.

Here is an illustration of the problem. Four iterations, each taking one time unit, computed sequentially and in parallel with two threads. The overhead for both fork and join is two time units:

|    sequential                 parallel
|  +------------+     +-------------------------+
|  |    it.0    |     |          fork           |
|  |    it.1    |     |        overhead         |
|  |    it.2    |     |    it.0    |    it.2    |
|  |    it.3    |     |    it.1    |    it.3    |
|  +------------+     |          join           |
|                     |        overhead         |
|                     +-------------------------+
v  time

Although dividing the iterations between two threads make the computation twice as fast, the overhead makes the parallel version slower overall.

The same, but now with ten iterations:

|    sequential                 parallel
|  +------------+     +-------------------------+
|  |    it.0    |     |          fork           |
|  |    it.1    |     |        overhead         |
|  |    it.2    |     |    it.0    |    it.5    |
|  |    it.3    |     |    it.1    |    it.6    |
|  |    it.4    |     |    it.2    |    it.7    |
|  |    it.5    |     |    it.3    |    it.8    |
|  |    it.6    |     |    it.4    |    it.9    |
|  |    it.7    |     |          join           |
|  |    it.8    |     |        overhead         |
|  |    it.9    |     +-------------------------+
|  +------------+
|
v  time

Clearly, the parallel version is now faster and will get even faster the more iterations there are, approaching asymptotically from below a speedup of 2x. Note that the problem here is not that there are only four iterations in the first case, but that those iterations take only one time unit each. It is fine to use OpenMP for a problem with small number of iterations but large amount of computational time per iteration.

The problem in the first case can be greatly exacerbated if the parallel region is inside an outer loop that with many iterations, which is exactly your case. The canonical solution is to move the outer loop inside the parallel region. This way, there will be a single fork and a single join and the overhead will not get replicated. With your code, something like this:

#pragma omp parallel private(i,k,dx,dy,dz,d,a)
for(i=0;i<loop;i++)
{
   #pragma omp parallel for schedule(static)
   for(j=0;j<N;j++)
   {
      for(k=0;k<N;k++)
      {
         if(j!=k)
         {
            dx=C[k*3+0]-C[j*3+0];
            dy=C[k*3+1]-C[j*3+1];
            dz=C[k*3+2]-C[j*3+2];

            d=sqrt(pow(dx,2)+pow(dy,2)+pow(dz,2));

            F[j*3+0]-=G*M[j]*M[k]/pow(d,3)*dx;
            F[j*3+1]-=G*M[j]*M[k]/pow(d,3)*dy;
            F[j*3+2]-=G*M[j]*M[k]/pow(d,3)*dz;
         }
      }
   }

   #pragma omp for schedule(static)
   for(j=0;j<N;j++)
   {
      for(k=0;k<3;k++)
      {
         a=F[j*3+k]/M[j];
         V[j*3+k]=V[j*3+k]+a*Dt[i];
         C[j*3+k]=C[j*3+k]+V[j*3+k]*Dt[i];
      }
   }
}

You have to be very careful now because the entire loop is inside the parallel region and each thread is executing all iterations, i.e., there is no distribution of iterations. There is no worksharing directive applied to the i-loop and therefore i must be given explicitly the private treatment. A better coding style would have all private variables declared inside the parallel region, in which case there will be no need for a private clause at all, but this is not done here for demonstration reasons.

Because the i-loop iterations are not independent from one another, you have to make sure that all threads are doing them in lock-step. This is usually achieved with barrier synchronisation, which in the code above comes from the implicit barriers at the end of the for constructs. The same applies to different stages inside the iteration. Again, here the second worksharing construct does not start before the previous one has finished due to the implicit barrier at the end of the latter.