multithreading performance parallel-processing fortran openmp

Why my parallel code using openMP atomic takes a longer time than serial code?

The snippet of my serial code is shown below.

 Program main
  use omp_lib
  Implicit None
   
  Integer :: i, my_id
  Real(8) :: t0, t1, t2, t3, a = 0.0d0

  !$ t0 = omp_get_wtime()
  Call CPU_time(t2)
  ! ------------------------------------------ !

    Do i = 1, 100000000
      a = a + Real(i)
    End Do

  ! ------------------------------------------ !
  Call CPU_time(t3)
  !$ t1 = omp_get_wtime()
  ! ------------------------------------------ !

  Write (*,*) "a = ", a
  Write (*,*) "The wall time is ", t1-t0, "s"
  Write (*,*) "The CPU time is ", t3-t2, "s"
End Program main

The elapsed time:

By using omp directives do and atomic, I convert serial code into parallel code. However, the parallel program is slower than the serial program. I don't understand why this happened. The next is my parallel code snippet:

Program main
  use omp_lib
  Implicit None
    
  Integer, Parameter :: n_threads = 8
  Integer :: i, my_id
  Real(8) :: t0, t1, t2, t3, a = 0.0d0
 
  !$ t0 = omp_get_wtime()
  Call CPU_time(t2)
  ! ------------------------------------------ !

  !$OMP Parallel Num_threads(n_threads) shared(a)
  
   !$OMP Do 
     Do i = 1, 100000000
       !$OMP Atomic
       a = a + Real(i)
     End Do
   !$OMP End Do
  
  !$OMP End Parallel
  
  ! ------------------------------------------ !
  Call CPU_time(t3)
  !$ t1 = omp_get_wtime()
  ! ------------------------------------------ !

  Write (*,*) "a = ", a
  Write (*,*) "The wall time is ", t1-t0, "s"
  Write (*,*) "The CPU time is ", t3-t2, "s"
End Program main

The elapsed time:

So my question is Why my parallel code using openMP atomic takes a longer time than serial code?

Solution

You are applying an atomic operation to the same variable in every single loop iteration. Moreover, that variable has interdependencies among those loop iterations. Naturally, that comes with additional overheads (e.g., synchronization, cost of serialization, and CPU cycles) when comparing with the sequential version. Furthermore, you are probably getting a lot of cache misses due to threads invalidating their caches.

This code is the typical code that should be using a reduction of the variable a (i.e., !$omp parallel do reduction(+:a)) instead of an atomic operation. With the reduction operation, each thread will have a private copy of the variable 'a', and at end of the parallel region, threads will reduce their copies of the variable 'a' (using the '+' operator) into a single value that will be propagated to the variable 'a' of the main thread.

You can find a more detailed answer about the differences between atomic vs. reduction on this SO thread. In that thread, there is even a code, which (just like yours) its atomic version is several orders of magnitude slower than its sequential counterpart (i.e., 20x slower). In that case it is even worst than yours (i.e., 20x Vs 10x).