OpenMP Dot Product and Pointers

I'm trying to implement dotproduct in OpenMP with large arrays allocated with malloc. However, when I use reduction(+:result) it produces different results for each program run. Why do I get different results? How can I remedy that? And how can this example be optimized? Here's my code:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <omp.h>

const int N = 1e1;

int main ()
{
  int    i, nthreads, tid;
  double x_seq, x_par, *y, *z, cpu_time_used;
  clock_t start, end;

  y = (double*)malloc(sizeof(double)*N);
  z = (double*)malloc(sizeof(double)*N);

  for (i=0; i<N; i++) {
      y[i] = i * 1.0;
      z[i] = i * 2.0;
  }

  x_seq = 0;
  x_par = 0;
  for (i=0; i<N; i++) x_seq += y[i] * z[i];

  #pragma omp parallel shared(y, z) private(i, tid)
  {
      #pragma omp single
      {
          nthreads = omp_get_num_threads();
      }
      tid = omp_get_thread_num();

      #pragma omp parallel for reduction(+:x_par)
      for (i=tid; i<N; i+=nthreads)
      {
          x_par += y[i] * z[i];
      }

  }
  return 0;
}

Solution

There's several things wrong here.

Let's look at the loop as it stands:

#pragma omp parallel shared(y, z) private(i, tid)
{
  #pragma omp single
  {
      nthreads = omp_get_num_threads();
  }
  tid = omp_get_thread_num();

  #pragma omp parallel for reduction(+:x_par)
  for (i=tid; i<N; i+=nthreads)
  {
      x_par += y[i] * z[i];
  }
}

So (1) notice that you (presumably) want x_par to be accessible outside of this region. So you'll want the reduction(+:x_par) on the outer section, not the inner. If you also add the very helpful default(none) clause, you'll also find that there's no clause describing the sharing of nthreads; let's explicitly make that shared.

So let's look again:

#pragma omp parallel shared(y, z, nthreads) private(i, tid) reduction(+:x_par) default(none)
{
  #pragma omp single
  {
      nthreads = omp_get_num_threads();
  }
  tid = omp_get_thread_num();

  #pragma omp parallel for 
  for (i=tid; i<N; i+=nthreads)
  {
      x_par += y[i] * z[i];
  }
}

So looking more carefully, we now see that you have two omp parallel sections. That means, if nested parallelism is enabled, that you'll have nthreads tasks each launching nthreads taks to do that loop; so the loop would end up with nthreads times the right answer if everything worked. So let's get rid of the parallel, just using the for:

 #pragma omp parallel shared(y, z, nthreads) private(i, tid) reduction(+:x_par) default(none)
{
  #pragma omp single
  {
      nthreads = omp_get_num_threads();
  }
  tid = omp_get_thread_num();

  #pragma omp for 
  for (i=tid; i<N; i+=nthreads)
  {
      x_par += y[i] * z[i];
  }
}

So that has the sharing correct and isn't nesting parallelism, but it still doesn't give the right answer; it gives a much smaller result. What's wrong? Let's take a look at the for loop. Each thread wants to start at tid and skip nthreads, fine; but why do we want the omp for there?

Let's take a look at a simpler version which works:

#pragma omp parallel shared(y, z) reduction(+:x_par) default(none)
{
  #pragma omp for
  for (i=0; i<N; i++)
  {
      x_par += y[i] * z[i];
  }
}

Note that here we aren't decomposing the loop explicitly with tid and nthreads - we don't have to, because omp for decomposes the loop for us; it assigns loop iterations to threads.

So looking back at what we have, we have a manual decomposition of the loop - which is fine, sometimes that's what you need to do; and an omp for which tries to take that loop and split it up amongst threads. But we're already doing that; the omp for simply makes us skip iterations here!

So getting rid of the omp for

#pragma omp parallel shared(y, z, nthreads) private(i, tid) reduction(+:x_par) default(none)
{
  #pragma omp single
  {
      nthreads = omp_get_num_threads();
  }
  tid = omp_get_thread_num();

  for (i=tid; i<N; i+=nthreads)
  {
      x_par += y[i] * z[i];
  }
}

Gets us the right answer.