Fortran OpenMP calculation for partial sums

I am very new to Fortran and am working through an exercise that involves summing numbers in parallel through OpenMP.

I have been given that the following code correctly calculates the sum of numbers in parallel through OpenMP

!$omp parallel do private (I)
!$omp+ reduction(+:totals)
do I=1,100
    totals = totals + localsum(I)
enddo
!$omp end parallel do

If I adjust the above code so that I can run it in my own Fortran program, I produce

Program test
    implicit none
    real totals
    double precision, dimension (1 : 100) :: localsum
    integer I

    !$omp parallel do private (I)
    !$omp+ reduction(+:totals)
    do I=1,100
        localsum(I)=I
        totals = totals + localsum(I)
    enddo
    !$omp end parallel do
    print *, 'The calculated sum total is', totals
end

This program returns

The calculated sum total is   5050.00000

However, I'm not sure why I needed to add the additional line for

localsum(I)=I

when the original code didn't have this line. I notice that if I remove

!$omp+ reduction(+:totals)

Then

Program test
    implicit none
    real totals
    double precision, dimension (1 : 100) :: localsum
    integer I

    !$omp parallel do private (I)
    do I=1,100
        localsum(I)=I
        totals = totals + localsum(I)
    enddo
    !$omp end parallel do
    print *, 'The calculated sum total is', totals
end

returns

 The calculated sum total is   5050.00000

when the calculated total should be wrong. Including the reduction, !$omp+ reduction(+:totals), should be necessary to compute the correct totals.

Is there an alternative way of adjusting the do loop to match the original code provided? I'm not sure why I had to change

do I=1,100
    totals = totals + localsum(I)
enddo

do I=1,100
    localsum(I)=I
    totals = totals + localsum(I)
enddo

in order to calculate the local sum.

Solution

With or without the !$omp+ reduction(+:totals) the executed code is different.

Without this directive, you directly update the global var totals. This may work (and in your example it worked), but it far from being guaranteed. The problem is that this may lead to races.

Assume thread a and thread b want to update this var. They need:
1. to fetch the var from memory
2. to update it in the processor
3. to write it back to memory

What will be the relative order of these operations in threads a and b? It is unspecified.
If the order is 1a2a3a1b2b3b, there is no problem.
If it is 1a1b2a2b3a3b there will be a problem: 1a1b(threads a et b fetch the same value)2a2b(they update it more or less simultaneously)3a3b(thread a writes its result and it is overwritten by thread b value).

To avoid that, you can have atomic operations, that guarantee that the read-modify-write cycle cannot be interrupted but they are very expensive and may slow down significantly the execution time.

To avoid that, you must use reductions. The line !$omp+ reduction(+:totals) tells openmp to do a reduction in a safe and efficient way. What will actually be done is

setup a hidden local var to do the accumulation in the partial loop
at every iteration of the loop perform the accumulation in this local var
at the end accumulate these partial results to the global var totals in a safe way: an atomic operation will be performed in such a way that the global var is properly updated and avoid races between threads.

There are still atomic updates, but their number is reduced and the accumulation is mostly performed by fast local operations.

Concerning the usefulness of the line localsum(I)=I, it is required is the vector localsum is not initialized previously. But if the goal is just to add the first integers, you can just use

do I=1,100
    totals = totals + I
enddo

The performance will be improved and the result identical. And both loops are parallelized in a similar way.