Search code examples
c++cmapreduceopenmpreduction

Make a reduction with OpenMP to compute the final summed value of an element of matrix


I have the following double loop where I compute the element of matrix Fisher_M[FX][FY].

I tried to optimize it by putting an OMP pragma #pragma omp parallel for schedule(dynamic, num_threads), but the gain is not as good as expected.

Is there a way to do a reduction with OpenMP (of sum) to compute the element Fisher_M[FX][FY] quickly? Or maybe this is doable with MAGMA or CUDA?

#define num_threads 8

#pragma omp parallel for schedule(dynamic, num_threads)
for(int i=0; i<CO_CL_WL.size(); i++){
    for(int j=0; j<CO_CL_WL.size(); j++){
        if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0){
          Fisher_M[FX][FY] += CO_CL_WL[i][j]*CO_CL_WL_D[i][j];
        }
    }
}

Solution

  • Your code has a race condition at line Fisher_M[FX][FY] += .... Reduction can be used to solve it:

    double sum=0;  //change the type as needed
    #pragma omp parallel for reduction(+:sum) 
    for(int i=0; i<CO_CL_WL.size(); i++){
        for(int j=0; j<CO_CL_WL.size(); j++){
            if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0){
              sum += CO_CL_WL[i][j]*CO_CL_WL_D[i][j];
            }
        }
    }
    Fisher_M[FX][FY] += sum;
    

    Note that this code is memory bound, not computation expensive, so the perfomance gain by parallelization may be smaller than expected (and depends on your hardware).

    Ps: Why do you need this condition if( CO_CL_WL[i][j] != 0 || CO_CL_WL_D[i][j] != 0)? If any of them is zero, the sum will not change. If you remove it, the compiler can make much better vectorized code.

    Ps2: In the schedule(dynamic, num_threads) clause the second parameter is the chunk size not the number of threads used. I suggest removing it in your your case. If you wish to specify the number of threads used, please add num_threads clause or use omp_set_num_threads function.