c parallel-processing openmp race-condition pragma

Problem with outer loop vectorization using OpenMP in C

I am lerning how to use OpenMP to make a code use muttiple processors. Recently, I tried to make my Ewald Summation Fourier part parallel using OpenMP. Below is the function named PME_fourier_oprimized which I try to make parallel using OpenMP pragma where,

ParticleN = Number of cahrged particles in the System (an integer number)
kcount = an integer number
qi = prop[i][0] (a double float) which gives charge of a particle i
Ext_L[0] = x dimension of the system (a double float number)
Ext_L[1] = y dimension of the system (a double float number)
Ext_L[1] = z dimension of the system (a double float number)
**r_pos = ParticleNx3 array (r_pos[i][0], r_pos[i][1], and r_pos[i][2] = x, y , and z postions of a particle i)
**PME_mev_ksq = kcountx4 array
**f_kewald = ParticleNx3 array which stores the forces

 double PME_Fourier_optimized(int kcount, double **r_pos, double **PME_mvec_ksq, double **prop, double **f_kewald, double *Ext_L){
    
    // Auto alpha determination //
    double TRTF = 50*5.5;
    double box_VOL = 8*Ext_L[0]*Ext_L[1]*Ext_L[2]; // calculate the volume of the box //
    alpha = pow((ParticleN*M_PI*M_PI*M_PI*TRTF)/(box_VOL*box_VOL),1./6.); // Ewald cutoff parameter //

    double GAMMA = -0.25/(alpha*alpha);
    double recip = 2*M_PI*ONE_PI_EPS0/(8*Ext_L[0]*Ext_L[1]*Ext_L[2]);
    double e_kewald=0.0;
    
    double fx_kewald[ParticleN];
    double fy_kewald[ParticleN];
    double fz_kewald[ParticleN];

    #pragma omp parallel for
    for (int i=0;i<ParticleN;i++){
        fx_kewald[i] = 0.0;
        fy_kewald[i] = 0.0;
        fz_kewald[i] = 0.0;
        for (int j=0;j<dimen;j++){
            f_kewald[i][j] = 0.0; 
        }
    }
    
    #pragma omp parallel for reduction(+:e_kewald,fx_kewald[:ParticleN],fy_kewald[:ParticleN],fz_kewald[:ParticleN])
    for(int k=0; k<kcount; k++){
        
        double ak_cos=0.0;
        double ak_sin=0.0;
        
        double mx = PME_mvec_ksq[k][0]; 
        double my = PME_mvec_ksq[k][1];
        double mz = PME_mvec_ksq[k][2];
        double ksq = PME_mvec_ksq[k][3];

        #pragma omp parallel for reduction(+:ak_sin,ak_cos)
        for(int i=0;i<ParticleN;i++){
            double qi = prop[i][0]; // charge of particle i //
            double kdotr = mx*r_pos[i][0] + my*r_pos[i][1] + mz*r_pos[i][2];

            ak_sin -= qi*sin(kdotr);
            ak_cos += qi*cos(kdotr);
        }
        
        double a = ak_cos;
        double b = ak_sin;

        double akak = (a*a + b*b);
        double tmp = recip * exp(GAMMA*ksq)/ksq;
        
        #pragma omp parallel for reduction (+:fx_kewald[:ParticleN],fy_kewald[:ParticleN],fz_kewald[:ParticleN])
        for(int i=0;i<ParticleN;i++){
            double qi = prop[i][0];
            double kdotr = mx*r_pos[i][0] + my*r_pos[i][1] + mz*r_pos[i][2];
            double tmp2 = 2*tmp*qi*(sin(kdotr) * a + cos(kdotr) * b);
            
            //Edit3: Following @Laci advice to use 1D array to store the forces and using the reduction to gain a huge acceleration compared to only using omp critical (See the comment section for more details).

            fx_kewald[i] += tmp2 * mx; 
            fy_kewald[i] += tmp2 * my; 
            fz_kewald[i] += tmp2 * mz;
        }
        
        e_kewald += tmp * akak; // add the energies for each k //
    }
    
    //Edit3: Finally storing the calculated values in the 2D array //
    #pragma omp parallel for
    for(int i=0;i<ParticleN;i++){
        f_kewald[i][0] = fx_kewald[i];
        f_kewald[i][1] = fy_kewald[i];
        f_kewald[i][2] = fz_kewald[i];
    }

    //printf("PME KEwald: %lf\n", e_kewald);
    return e_kewald; 
}

I compared this code output with my serial code version and notied there is an error in the e_kewald energy and f_kewald force after 2 decimal places. Interestingly, I get differnt results each time I run it (maybe it is the case of data race).

When I remove the first pragma command ( #pragma omp parallel for private(k, mx, my, mz, ksq, qi, kdotr, tmp2) reduction(+:e_kewald) ) before the outer k loop, it starts to work perfectly and matches perfectly with the serial code result. Is there something that I am doing grossly wrong? I am also not sure if the way I did use pargma is correct or not. Any help is highly appreciated.

Edit1: After using the #pragma omp critical before the f_kewald array solves the problem. But, not sure if it's the best way to do it.

My 2d arrays are defined using the following function:

double **alloc_2d_double(int rows, int cols){
    int i=0;
    double *data = (double *)malloc(rows*cols*sizeof(double));
    double **array= (double **)malloc(rows*sizeof(double *));
    for (i=0; i<rows; i++)
        array[i] = &(data[cols*i]);

    return array;
}

I am still wondering if the f_kewald[i][0], f_kewald[i][1], f_kewald[i][2] can be use under the reduction.

Edit2: Declaring the variables inside the loop and putting the braces in the omp critical section.

ParticleN ~ 5000-10000 kcount ~ 2000

Edit3: Using 1D arrays (fx_kewald, fy_kewald, fz_kewald) with reduction to gain enhanced parallelization speed instead of a opm critical section with 2D arrays. (Thanks to the @Laci's comment below)

Solution

A simplified view of your code:

    #pragma omp parallel for ...
    for(k=0; k<kcount; k++){
            ...
            f_kewald[i][0] += ...
            ...
    }

It's clear that different threads will update f_kewald[i][0] simultaneously, leading to errors. This happens for all i.

When you stuff all your code updating f_kewald into a critical section, this helps — now threads wait for each other. But still the updates are in arbitrary order, and the critical section doesn't discriminate between different values of i.

I suggest you parallelize your calculations only at one level - decide whether you want to do it at the outer loop, or at the inner loops, not both. Assuming your ParticleN is big enough, you don't need to parallelize your outer loop. This is the easy case.

If your ParticleN can be small, better parallelize your outer loop, but then you have to fix the synchronization problem of f_kewald somehow. Some ideas:

Switch your loops inside-out
Break your outer loop into several pieces, each of which contains only one inner loop

This is a bad situation — you have to tweak your code, and it may be hard or impossible. If you can avoid it at all (see above), please do!

Regardless of anything, multiple people advised not using private to control allocation of variables to threads. This is good advice! You should declare your variables in the proper C scope instead of using private. This will make it easier to transform your code (if you ever need to), and will prevent unpleasant surprises (when you forget just one variable in your giant private list).