Parallel simulation gives different results after some time steps when compared with serial and additional parallel runs

I am trying to run a code on vortex simulations in parallel using OpenMP. These are similar to particle simulations where at each time step, the position of a vortex at the next time step has to be computed from its velocity which is determined by the positions of all the other vortices at the current time step. The vortices are deleted once they leave the domain. I compare the number of vortices at each time step for the parallel version of code with the serial version of code, and run each version multiple times.

For the serial versions, vortex counts match exactly at every time step. For the parallel case, all the runs match with the serial case for a few tens of time steps, post which, each parallel run shows a difference but remains within a 7-10% error bound with the serial case (as can be seen in the result link below). I know that this may be because of the round off errors in the parallel case owing from the difference in the order of computational steps due to distribution among the different threads, but should the error really be so high as 10%?

I have only used the reduction clause in a parallel do construct. The only parallel region in the whole code is within a function vblob(), which is inside a module, which I call from a main code. All function calls within vblob() are ixi(), fxi() are outside this module.

function vblob(blobs,xj,gj)
    complex(8), intent(in) :: blobs(:,:), xj
    complex(8) :: delxi, delxic, di, gvic, xi
    real(8), intent(in) :: gj
    real(8) :: vblob(2)
    integer :: p

    gvic = 0.0; delxi = 0.0; delxic = 0.0; di = 0.0; xi = 0.0
    !$omp parallel do private(xi,delxic,delxi,di) shared(xj) reduction(+:gvic)
    do p = 1, size(blobs,1)
      xi = ixi(blobs(p,1))
      delxic = xj-conjg(xi)
      delxi = xj-xi
      di = del*fxi(xi)
      gvic = gvic + real(blobs(p,2))*1/delxic
      if (abs(delxi) .gt. 1.E-4) then
        gvic = gvic +  (-1)*real(blobs(p,2))*1/delxi
      end if
    end do
    !$omp end parallel do
    gvic = j*gvic*fxi(xj)/(2*pi)
    vblob(1) = real(gvic)
    vblob(2) = -imag(gvic)

  end function vblob

If the way I have constructed the parallel code is wrong, then errors should show up from the first few time steps itself, right?

(As can be seen in this result, the 'blobs' and 'sheets' are just types of vortex elements, the blue line is the total elements. P and S stand for Parallel and serial respectively and R stands for runs. THe solid plot markers are the serial code and the hollow ones are the three runs of the parallel code)

EDIT: When i change the numerical precision of my variables to real(4) instead, the divergenec in results happens at an earlier time step than the real(8) case above. SO its clearly a round off error issue.

TLDR: I want to clarify this with anyone else who has seen such a result over a range of time steps, where the parallel code matches for the first few time steps and then diverges?

Solution

Your code essentially sums up a lot of terms in gvic. Floating-point arithmetic is not associative, that is, (a+b)+c is not the same as a+(b+c) due to rounding. Also, depending on the values and the signs on the terms, there may be a serious loss of precision in each operation. See here for a really mandatory read on the subject.

While the sequential loop computes (given no clever compiler optimisations):

gvic = (...((((g_1 + g_2) + g_3) + g_4) + g_5) + ...)

where g_i is the value added to gvic by iteration i, the parallel version computes:

gvic = t_0 + t_1 + t_2 + ... t_(#threads-1)

where t_i is the accumulated private value of gvic in thread i (threads in OpenMP are 0-numbered even in Fortran). The order in which the different t_is are reduced is unspecified. The OpenMP implementation is free to choose whatever it deems fine. Even if all t_is are summed in order, the result will still differ from the one computed by the sequential loop. Unstable numerical algorithms are exceptionally prone to producing different results when parallelised.

This is something you can hardly avoid completely, but instead learn to control or simply live with its consequences. In many cases, the numerical solution to a problem is an approximation anyway. You should focus on conserved or statistical properties. For example, an ergodic molecular dynamics simulation may produce a completely different phase trajectory in parallel, but values such as the total energy or the thermodynamic averages will be pretty close (unless there is some serious algorithmic error or really bad numerical instability).

A side note - you are actually lucky to enter this field now, when most CPUs use standard 32- and 64-bit floating-point arithmetic. Years ago, when x87 was a thing, floating-point operations were done with 80-bit internal precision and the end result would depend on how many times a value leaves and re-enters the FPU registers.