openmp shared and private member

I am trying to simply parallel a MD code with openmp. However, even if I use one thread with openmp. The result is not correct. If I commended out the omp part of the code the result is correct. I think the problem is the wrong use of shared and private variables in omp. Can anyone help me with this problem? Thank you very much! Code is below

int i, j;
// parallelization using openmp
#pragma omp parallel shared(fx2,fy2,fz2,rx,ry,rz,N_molecular,L) private(i, j, xmin, ymin, zmin,rmin2, f)
{
    #pragma omp for// reduction(+:fx2,fy2,fz2)
    /*calculate all pair forces acting on each particles again*/
    for(i = 0; i < N_molecular-1; i++){
        for(j = i+1; j < N_molecular; j++){
            xmin = rx[i] - rx[j] - L*round ((rx[i] - rx[j])/L);
            ymin = ry[i] - ry[j] - L*round ((ry[i] - ry[j])/L);
            zmin = rz[i] - rz[j] - L*round ((rz[i] - rz[j])/L);
            rmin2 = xmin*xmin + ymin*ymin + zmin*zmin;
            if(rmin2 < rcut*rcut)
            {
                f = 48./pow(rmin2,7) - 24./pow(rmin2,4);
                //#pragma omp atomic update
                fx2[i] += f*xmin;
                //#pragma omp atomic update
                fy2[i] += f*ymin;
                //#pragma omp atomic update
                fz2[i] += f*zmin;
                //#pragma omp atomic update
                fx2[j] = fx2[j] - f*xmin;
                //#pragma omp atomic update
                fy2[j] = fy2[j] - f*ymin;
                //#pragma omp atomic update
                fz2[j] = fz2[j] - f*zmin;
            }
        }
    }
}

Solution

The tricky part in parallelising this code is to make sure that the updates for the force arrays are not generating race conditions. Here, when done using the i index, no issue will occur since different threads will handle different i indexes. However, this isn't true for the j indexes and that is enough to prevent you from using this approach.

A solution would be to store your forces in some local per-thread arrays, and to accumulate the final individual results in the global arrays upon exit of the parallel region. Other possibilities exist, which I describe in (I hope) details in this answer to a question very similar to yours.