c++multithreading performance parallel-processing openmp

OpenMP returning wrong result

The code below should use 4 threads to calculate the summation of all numbers between 0 and 1000. It should return 499500 but it is returning different values at each execution.

#include <omp.h>
#include <iostream>

using namespace std;

int main (int argc, char *argv[])
{
        int nthreads, i, tid;
        float total;

        #pragma omp parallel num_threads(4) 
        {
            tid = omp_get_thread_num();
            if (tid == 0) {
                nthreads = omp_get_num_threads();
                cout << "Número de threads = " << nthreads <<endl;
            }
           #pragma omp barrier

           total = 0.0;
           #pragma omp for schedule(dynamic,10) private(i)
           for (i=0; i<1000; i++)
               total = total + i*1.0;
        } 
        cout << "Total = " <<total << endl;
        return 0;
}

Solution

What is happening in your code is that you have multiple threads concurrently modifying the value of the variable total. To solve this issue, you can use OpenMP reduction clause, which from the OpenMP standard one can read:

The reduction clause can be used to perform some forms of recurrence calculations (...) in parallel. For parallel and work-sharing constructs, a private copy of each list item is created, one for each implicit task, as if the private clause had been used. (...) The private copy is then initialized as specified above. At the end of the region for which the reduction clause was specified, the original list item is updated by combining its original value with the final value of each of the private copies, using the combiner of the specified reduction-identifier.

For a more detailed explanation on how the reduction clause works have a look a this SO Thread.

So to solve the race-condition in your code just changed it to:

        #pragma omp for schedule(dynamic,10) private(i) reduction(+:total)
        for (i=0; i<1000; i++)
             total = total + i*1.0;

After applying the reduction clause, to avoid wrong results, do not initialize the shared variable total within the parallel region:

total = 0.0;

simply set it before the parallel region.

There is another race condition regarding the variable tid itself, which is shared among threads and updated concurrently inside the parallel region:

tid = omp_get_thread_num();

This can be solve by making tid private to each thread for instance:

int tid = omp_get_thread_num();

Side Notes

In OpenMP the outermost loop enclosed by the #pragma omp for will have its index variable (i in this context) already private, therefore the clause private(i) is not really necessary.

Another point is the schedule(dynamic,10); unless you are just playing around, it actually makes more sense (performance-wise) to use the schedule static, since the code does not lead to loading balancing issues. The dynamic schedule has the additional overhead of assigning at run-time the tasks to threads, whereas the assignment in the static schedule is performed at compile-time.

Finally, the following :

    tid = omp_get_thread_num();
    if (tid == 0) {
        nthreads = omp_get_num_threads();
        cout << "Número de threads = " << nthreads <<endl;
    }

can be simplified by using the OpenMP master clause, namely:

    #pragma omp master
    {
        nthreads = omp_get_num_threads();
        cout << "Número de threads = " << nthreads <<endl;
    }

A running solution with the changes applied:

#include <omp.h>
#include <iostream>

using namespace std;

int main (int argc, char *argv[])
{
        float total = 0.0;

        #pragma omp parallel num_threads(4) 
        {
           #pragma omp master
           {
                int nthreads = omp_get_num_threads();
                cout << "Número de threads = " << nthreads <<endl;
           }
           #pragma omp barrier

           #pragma omp for reduction(+:total)
           for (int i=0; i < 1000; i++)
               total = total + i*1.0;
        } 
        cout << "Total = " <<total << endl;
        return 0;
}

Output:

Número de threads = 4
Total = 499500