parallel-processing synchronization mpi openmp barrier

What is "implicit synchronization" in OpenMP

What is exactly "implicit synchronization" in OpenMP and how can you spot one? My teacher said that

#pragma omp parallel
printf(“Hello 1\n”);

Has an implicit sync. Why? And how do you see it?

Solution

Synchronisation is an important issue in parallel processing and in openmp. In general parallel processing is asynchronous. You know that several threads are working on a problem, but you have no way to know exactly what is their actual state, the iteration they are working on, etc. A synchronisation allows you get control on thread execution.

There are two kinds of synchronisations in openmp: explicit and implicit. An explicit synchronisation is done with a specific openmp construct that allows to create a barrier: #pragma omp barrier. A barrier is a parallel construct that can only be passed by all the threads simultaneously. So after the barrier, you know exactly the state of all threads and, more importantly, what amount of work they have done.

Implicit synchronisation is done in two situations:

at the end of a parallel region. Openmp relies on a fork-join model. When the program starts, a single thread (master thread) is created. When you create a parallel section by #pragma omp parallel, several threads are created (fork). These threads will work concurrently and at the end of the parallel section will be destroyed (join). So at the end of a parallel section, you have a synchronisation and you know precisely the status of all threads (they have finished their work). This is what happens in the example that you give. The parallel section only contains the printf() and at the end, the program waits for the termination of all threads before continuing.
at the end of some openmp constructs like #pragma omp for or #pragma omp sections, there is an implicit barrier. No thread can continue working as long as all the threads have not reached the barrier. This is important to know exactly what work has been done by the different threads.

For instance, consider the following code.

#pragma omp parallel
{
  #pragma omp for
  for(int i=0; i<N; i++)
    A[i]=f(i); // compute values for A
  #pragma omp for
  for(int j=0; j<N/2; j++)
    B[j]=A[j]+A[j+N/2];// use the previously computed vector A
} // end of parallel section

As all the threads work asynchronously, you do not know which threads have finished creating their part of vector A. Without a synchronisation, there is a risk that a thread finishes rapidly its part of the first for loop, enters the second for loop and accesses elements of vector A while the threads that are supposed to compute them are still in the first loop and have not computed the corresponding value of A[i].

This is reason why openmp compilers add an implicit barrier to synchronize all the threads. So you are certain that all threads have finished all their work and that all values of A have been computed when the second for loop starts.

But in some situations, no synchronisation is required. For instance, consider the following code:

#pragma omp parallel
{
  #pragma omp for
  for(int i=0; i<N; i++)
    A[i]=f(i); // compute values for A
  #pragma omp for
  for(int j=0; j<N/2; j++)
    B[j]=g(j);// compute values for B
} // end of parallel section

Obviously the two loops are completely independent and it does not matter if A is properly computed to start the second for loop. So the synchronisation gives nothing for the program correctness and adding a synchronisation barrier has two major drawbacks:

If function f() has very different running times, you may have some threads that have finished their work, while others are still computing. The synchronisation will force the former threads to wait and this idleness do not exploit properly parallelism.
Synchronisations are expensive. A simple way to realize a barrier is to increment a global counter when reaching the barrier and to wait until the value of the counter is equal to the number of threads omp_get_num_threads(). To avoid races between threads, the incrementation of the global counter must be done with an atomic read-modify-write that requires a large number of cycles and the wait for the proper value of the counter is typically done with a spin lock that wastes processor cycles.

So there is construct to suppress implicit synchronisations and the best way to program the previous loop would be:

#pragma omp parallel
{
  #pragma omp for nowait  // nowait suppresses implicit synchronisations. 
  for(int i=0; i<N; i++)
    A[i]=f(i); // compute values for A
  #pragma omp for
  for(int j=0; j<N/2; j++)
    B[j]=g(j);// compute values for B
} // end of parallel section

This way, as soon as a thread has finished its work in the first loop, it will immediately start to process the second for loop, and, depending on the actual program, this may reduce significantly execution time.